diff --git a/docs/reference/faq.md b/docs/reference/faq.md index 905b0ab4a..30d925a70 100644 --- a/docs/reference/faq.md +++ b/docs/reference/faq.md @@ -14,7 +14,7 @@ Tests are strongly encouraged and you must have at least one test for every serv # How To: Upload and download a dataset from HuggingFace -The huggingface client requires that your credentials are in `env.yaml`, along with some other pertinent details needed to upload to the designated place. +The huggingface client requires that your credentials are in `env.yaml`, along with some other pertinent details needed to upload to the designated place. ```yaml hf_token: {your huggingface token} hf_organization: {your huggingface org} @@ -22,19 +22,19 @@ hf_collection_name: {your collection} hf_collection_slug: {your collection slug} # alphanumeric string found at the end of a collection URI # optional: -hf_dataset_prefix: str # field to override the default value "NeMo-Gym" prepended to the dataset name +hf_dataset_prefix: str # field to override the default value "Nemotron-RL" prepended to the dataset name ``` Naming convention for Huggingface datasets is as follows. -`{hf_organization}/{hf_dataset_prefix}-{domain}–{resource_server_name}-{your dataset name}` +`{hf_organization}/{hf_dataset_prefix}-{domain}–{resource_server OR dataset_name}` E.g.: -`NVIDIA/Nemo-Gym-Math-math_with_judge-dapo17k` +`nvidia/Nemotron-RL-math-OpenMathReasoning` -You will only need to manually input the `{your dataset name}` portion of the above when inputting the `dataset_name` flag in the upload command (refer to the command below). Everything preceding it will be automatically populated using your config prior to upload. +You will only need to manually input the `{dataset_name}` portion of the above when inputting the `dataset_name` flag in the upload command (refer to the command below). Everything preceding it will be automatically populated using your config prior to upload. Note that it is optional, and overrides `resource_server` if used. To upload to Huggingface, use the below command: ```bash @@ -47,6 +47,45 @@ ng_upload_dataset_to_hf \ Because of the required dataset nomenclature, the resource server config path is required when uploading. Specifically, `domain` is used in the naming of a dataset in Huggingface. +By default, the `split` parameter for uploading is set to `train`, which will run a check on the required fields `{"responses_create_params", "reward_profiles", "expected_answer"}`. Specifying `validation` or `test` bypasses this check: + +```bash +resource_config_path="resources_servers/multineedle/configs/multineedle.yaml" +ng_gitlab_to_hf_dataset \ + +dataset_name={your dataset name} \ + +input_jsonl_fpath=data/multineedle_benchmark_validation.jsonl \ + +resource_config_path=${resource_config_path} \ + +split=validation +``` + +## Uploading with Pull Request workflow +When uploading to an organization repository where you don't have direct write access (e.g., nvidia/), use the `+create_pr=true` flag to create a Pull Request instead of pushing directly. You can also customize the commit message and description. + +If you want to specify the revision (branch name), you can add the `+revision={your branch name}` flag. Excluding `create_pr` (or setting it to `false`) assumes you are committing to an existing branch. Including it assumes it will be a brand new branch. + +```bash +ng_upload_dataset_to_hf \ + +dataset_name=OpenMathReasoning \ + +input_jsonl_fpath=data/validation.jsonl \ + +resource_config_path=${resource_config_path} \ + +split=validation \ + +create_pr=true \ + +revision=my-branch-name \ + +commit_message="Add validation set" \ + +commit_description="Includes 545 examples" +``` + +The command will output a link to the created Pull Request: +```bash +[Nemo-Gym] - Pull Request created: https://huggingface.co/datasets/nvidia/Nemotron-RL-math-OpenMathReasoning/discussions/1 +``` + +:::{note} +The commit_message and commit_description parameters work for both direct pushes and Pull Requests. If not provided, HuggingFace auto-generates a commit message based on the filename. +::: + + +## Deleting Datasets from Gitlab You can optionally pass a `+delete_from_gitlab=true` flag to the above command, which will delete the model and all of its artifacts from Gitlab. By default, this is set to `False`. ```bash resource_config_path="resources_servers/multineedle/configs/multineedle.yaml" @@ -59,7 +98,7 @@ ng_upload_dataset_to_hf \ There will be a confirmation dialog to confirm the deletion: ```bash -[Nemo-Gym] - Dataset uploaded successful +[Nemo-Gym] - Dataset upload successful [Nemo-Gym] - Found model 'fs-test' in the registry. Are you sure you want to delete it from Gitlab? [y/N]: ``` @@ -83,6 +122,8 @@ ng_delete_dataset_from_gitlab \ Gitlab model names are case sensitive. There can be models named 'My_Model' and 'my_model' living simultaneously in the registry. When uploading to Huggingface with the intention of deleting Gitlab artifacts, be sure the casing of your Huggingface dataset name matches that of Gitlab's. ::: + +## Downloading Datasets from Huggingface Downloading a dataset from Huggingface is straightforward: **For structured datasets (with train/validation/test splits):** diff --git a/nemo_gym/config_types.py b/nemo_gym/config_types.py index 130c66f40..55e21e4b5 100644 --- a/nemo_gym/config_types.py +++ b/nemo_gym/config_types.py @@ -222,15 +222,33 @@ class BaseUploadJsonlDatasetHuggingFaceConfig(BaseNeMoGymCLIConfig): hf_organization: str = Field(description="HuggingFace organization name where dataset will be uploaded.") hf_collection_name: str = Field(description="HuggingFace collection name for organizing datasets.") hf_collection_slug: str = Field(description="Alphanumeric collection slug found at the end of collection URI.") - dataset_name: str = Field( - description="Name of the dataset (will be combined with domain and resource server name)." + dataset_name: Optional[str] = Field( + default=None, description="Name of the dataset (will be combined with domain and resource server name)." ) input_jsonl_fpath: str = Field(description="Path to the local jsonl file to upload.") resource_config_path: str = Field( description="Path to resource server config file (used to extract domain for naming convention)." ) hf_dataset_prefix: str = Field( - default="NeMo-Gym", description="Prefix prepended to dataset name (default: 'NeMo-Gym')." + default="Nemotron-RL", description="Prefix prepended to dataset name (default: 'NeMo-Gym')." + ) + split: Literal["train", "validation", "test"] = Field( + default="train", + description="Dataset split type (e.g., 'train', 'validation', 'test'). Format validation only applies to 'train' splits.", + ) + create_pr: bool = Field( + default=False, + description="Create a pull request instead of pushing directly. Required for repos where you do not have write access.", + ) + revision: Optional[str] = Field( + default=None, + description="Git revision (branch name) to upload to. Use the same revision for multiple files to upload to the same PR. If not provided with create_pr=True, a new branch/PR will be created automatically.", + ) + commit_message: Optional[str] = Field( + default=None, description="Custom commit message. If not provided, HuggingFace auto-generates one." + ) + commit_description: Optional[str] = Field( + default=None, description="Optional commit description with additional context." ) diff --git a/nemo_gym/hf_utils.py b/nemo_gym/hf_utils.py index 93c1b3caa..33647e215 100644 --- a/nemo_gym/hf_utils.py +++ b/nemo_gym/hf_utils.py @@ -33,7 +33,7 @@ def create_huggingface_client(token: str) -> HfApi: # pragma: no cover def check_jsonl_format(file_path: str) -> bool: # pragma: no cover """Check for the presence of the expected keys in the dataset""" - required_keys = {"responses_create_params", "reward_profiles", "expected_answer"} + required_keys = {"responses_create_params"} missing_keys_info = [] try: @@ -49,7 +49,7 @@ def check_jsonl_format(file_path: str) -> bool: # pragma: no cover return False except (FileNotFoundError, json.JSONDecodeError) as e: - print(f"[Nemo-Gym] - Error reading or prasing the JSON file: {e}") + print(f"[Nemo-Gym] - Error reading or parsing the JSON file: {e}") return False return True @@ -123,15 +123,19 @@ def upload_jsonl_dataset( with open(config.resource_config_path, "r") as f: data = yaml.safe_load(f) - domain = d.title() if (d := visit_resource_server(data).to_dict().get("domain")) else None + domain = d.lower() + "-" if (d := visit_resource_server(data).to_dict().get("domain")) else "" resource_server = config.resource_config_path.split("/")[1] - dataset_name = config.dataset_name + dataset_name = config.dataset_name or resource_server prefix = config.hf_dataset_prefix + "-" if config.hf_dataset_prefix else "" - repo_id = f"{config.hf_organization}/{prefix}{domain}-{resource_server}-{dataset_name}" - collection_id = f"{config.hf_organization}/{config.hf_collection_name}-{config.hf_collection_slug}" + collection_id = ( + f"{config.hf_organization}/{config.hf_collection_name.lower().replace(' ', '-')}-{config.hf_collection_slug}" + ) - # Dataset format check - if not check_jsonl_format(config.input_jsonl_fpath): + repo_id = f"{config.hf_organization}/{prefix}{domain}{dataset_name}" + + # Dataset format check - only strict check for training data + is_training = config.split.lower() == "train" + if is_training and not check_jsonl_format(config.input_jsonl_fpath): print("[Nemo-Gym] - JSONL file format check failed.") return @@ -140,8 +144,11 @@ def upload_jsonl_dataset( client.create_repo(repo_id=repo_id, token=config.hf_token, repo_type="dataset", private=True, exist_ok=True) print(f"[Nemo-Gym] - Repo '{repo_id}' is ready for use") except HfHubHTTPError as e: - print(f"[Nemo-Gym] - Error creating repo: {e}") - raise + if config.create_pr and "403" in str(e): + print(f"[Nemo-Gym] - Repo '{repo_id}' exists (no create permission, will create PR)") + else: + print(f"[Nemo-Gym] - Error creating repo: {e}") + raise # Collection id + addition try: @@ -159,14 +166,21 @@ def upload_jsonl_dataset( # File upload try: - client.upload_file( + commit_info = client.upload_file( path_or_fileobj=config.input_jsonl_fpath, path_in_repo=Path(config.input_jsonl_fpath).name, repo_id=repo_id, token=config.hf_token, repo_type="dataset", + create_pr=config.create_pr, + revision=config.revision, + commit_message=config.commit_message, + commit_description=config.commit_description, ) - print("[Nemo-Gym] - Dataset uploaded successful") + if config.create_pr: + print(f"[Nemo-Gym] - Pull Request created: {commit_info.pr_url}") + else: + print("[Nemo-Gym] - Dataset upload successful") except HfHubHTTPError as e: print(f"[Nemo-Gym] - Error uploading file: {e}") raise diff --git a/resources_servers/math_with_judge/configs/math_with_judge.yaml b/resources_servers/math_with_judge/configs/math_with_judge.yaml index 56ab515fe..9998bc15c 100644 --- a/resources_servers/math_with_judge/configs/math_with_judge.yaml +++ b/resources_servers/math_with_judge/configs/math_with_judge.yaml @@ -42,6 +42,9 @@ math_with_judge_simple_agent: dataset_name: aime24 version: 0.0.1 artifact_fpath: aime24.jsonl + huggingface_identifier: + repo_id: nvidia/Nemotron-RL-math-OpenMathReasoning + artifact_fpath: aime24_validation.jsonl license: Apache 2.0 - name: example type: example diff --git a/resources_servers/structured_outputs/configs/structured_outputs_json.yaml b/resources_servers/structured_outputs/configs/structured_outputs_json.yaml index a57f45f77..2c43a5772 100644 --- a/resources_servers/structured_outputs/configs/structured_outputs_json.yaml +++ b/resources_servers/structured_outputs/configs/structured_outputs_json.yaml @@ -35,6 +35,9 @@ structured_outputs_simple_agent: dataset_name: structured_outputs_251027_nano_v3_sdg_json_val version: 0.0.2 artifact_fpath: structured_outputs_251027_nano_v3_sdg_json_val.jsonl + huggingface_identifier: + repo_id: nvidia/Nemotron-RL-instruction_following-structured_outputs + artifact_fpath: structured_outputs_251027_nano_v3_sdg_json_val.jsonl license: Apache 2.0 - name: example type: example diff --git a/resources_servers/workplace_assistant/configs/workplace_assistant.yaml b/resources_servers/workplace_assistant/configs/workplace_assistant.yaml index 3472e169c..900f27252 100644 --- a/resources_servers/workplace_assistant/configs/workplace_assistant.yaml +++ b/resources_servers/workplace_assistant/configs/workplace_assistant.yaml @@ -35,6 +35,9 @@ workplace_assistant_simple_agent: dataset_name: workplace_assistant version: 0.0.4 artifact_fpath: validation.jsonl + huggingface_identifier: + repo_id: nvidia/Nemotron-RL-agent-workplace_assistant + artifact_fpath: validation.jsonl license: Apache 2.0 - name: example type: example