NVIDIA-NeMo · fsiino-nvidia · Dec 10, 2025 · Dec 10, 2025 · Dec 10, 2025 · Dec 12, 2025
diff --git a/docs/reference/faq.md b/docs/reference/faq.md
@@ -14,27 +14,27 @@ Tests are strongly encouraged and you must have at least one test for every serv
 
 
 # How To: Upload and download a dataset from HuggingFace
-The huggingface client requires that your credentials are in `env.yaml`, along with some other pertinent details needed to upload to the designated place. 
+The huggingface client requires that your credentials are in `env.yaml`, along with some other pertinent details needed to upload to the designated place.
 ```yaml
 hf_token: {your huggingface token}
 hf_organization: {your huggingface org}
 hf_collection_name: {your collection}
 hf_collection_slug: {your collection slug}  # alphanumeric string found at the end of a collection URI
 
 # optional:
-hf_dataset_prefix: str  # field to override the default value "NeMo-Gym" prepended to the dataset name
+hf_dataset_prefix: str  # field to override the default value "Nemotron-RL" prepended to the dataset name
 ```
 
 Naming convention for Huggingface datasets is as follows.
 
-`{hf_organization}/{hf_dataset_prefix}-{domain}–{resource_server_name}-{your dataset name}`
+`{hf_organization}/{hf_dataset_prefix}-{domain}–{resource_server OR dataset_name}`
 
 E.g.:
 
-`NVIDIA/Nemo-Gym-Math-math_with_judge-dapo17k`
+`nvidia/Nemotron-RL-math-OpenMathReasoning`
 
 
-You will only need to manually input the `{your dataset name}` portion of the above when inputting the `dataset_name` flag in the upload command (refer to the command below). Everything preceding it will be automatically populated using your config prior to upload.
+You will only need to manually input the `{dataset_name}` portion of the above when inputting the `dataset_name` flag in the upload command (refer to the command below). Everything preceding it will be automatically populated using your config prior to upload. Note that it is optional, and overrides `resource_server` if used.
 
 To upload to Huggingface, use the below command:
 ```bash
@@ -47,6 +47,45 @@ ng_upload_dataset_to_hf \
 
 Because of the required dataset nomenclature, the resource server config path is required when uploading. Specifically, `domain` is used in the naming of a dataset in Huggingface.
 
+By default, the `split` parameter for uploading is set to `train`, which will run a check on the required fields `{"responses_create_params", "reward_profiles", "expected_answer"}`. Specifying `validation` or `test` bypasses this check:
+
+```bash
+resource_config_path="resources_servers/multineedle/configs/multineedle.yaml"
+ng_gitlab_to_hf_dataset \
+    +dataset_name={your dataset name} \
+    +input_jsonl_fpath=data/multineedle_benchmark_validation.jsonl \
+    +resource_config_path=${resource_config_path} \
+    +split=validation
+```
+
+## Uploading with Pull Request workflow
+When uploading to an organization repository where you don't have direct write access (e.g., nvidia/), use the `+create_pr=true` flag to create a Pull Request instead of pushing directly. You can also customize the commit message and description.
+
+If you want to specify the revision (branch name), you can add the `+revision={your branch name}` flag. Excluding `create_pr` (or setting it to `false`) assumes you are committing to an existing branch. Including it assumes it will be a brand new branch.
+
+```bash
+ng_upload_dataset_to_hf \
+    +dataset_name=OpenMathReasoning \
+    +input_jsonl_fpath=data/validation.jsonl \
+    +resource_config_path=${resource_config_path} \
+    +split=validation \
+    +create_pr=true \
+    +revision=my-branch-name \
+    +commit_message="Add validation set" \
+    +commit_description="Includes 545 examples"
+```
+
+The command will output a link to the created Pull Request:
+```bash
+[Nemo-Gym] - Pull Request created: https://huggingface.co/datasets/nvidia/Nemotron-RL-math-OpenMathReasoning/discussions/1
+```
+
+:::{note}
+The commit_message and commit_description parameters work for both direct pushes and Pull Requests. If not provided, HuggingFace auto-generates a commit message based on the filename.
+:::
+
+
+## Deleting Datasets from Gitlab
 You can optionally pass a `+delete_from_gitlab=true` flag to the above command, which will delete the model and all of its artifacts from Gitlab. By default, this is set to `False`.
 ```bash
 resource_config_path="resources_servers/multineedle/configs/multineedle.yaml"
@@ -59,7 +98,7 @@ ng_upload_dataset_to_hf \
 
 There will be a confirmation dialog to confirm the deletion:
 ```bash
-[Nemo-Gym] - Dataset uploaded successful
+[Nemo-Gym] - Dataset upload successful
 [Nemo-Gym] - Found model 'fs-test' in the registry. Are you sure you want to delete it from Gitlab? [y/N]:
 ```
 
@@ -83,6 +122,8 @@ ng_delete_dataset_from_gitlab \
 Gitlab model names are case sensitive. There can be models named 'My_Model' and 'my_model' living simultaneously in the registry. When uploading to Huggingface with the intention of deleting Gitlab artifacts, be sure the casing of your Huggingface dataset name matches that of Gitlab's.
 :::
 
+
+## Downloading Datasets from Huggingface
 Downloading a dataset from Huggingface is straightforward:
 
 **For structured datasets (with train/validation/test splits):**

diff --git a/nemo_gym/config_types.py b/nemo_gym/config_types.py
@@ -222,15 +222,33 @@ class BaseUploadJsonlDatasetHuggingFaceConfig(BaseNeMoGymCLIConfig):
     hf_organization: str = Field(description="HuggingFace organization name where dataset will be uploaded.")
     hf_collection_name: str = Field(description="HuggingFace collection name for organizing datasets.")
     hf_collection_slug: str = Field(description="Alphanumeric collection slug found at the end of collection URI.")
-    dataset_name: str = Field(
-        description="Name of the dataset (will be combined with domain and resource server name)."
+    dataset_name: Optional[str] = Field(
+        default=None, description="Name of the dataset (will be combined with domain and resource server name)."
     )
     input_jsonl_fpath: str = Field(description="Path to the local jsonl file to upload.")
     resource_config_path: str = Field(
         description="Path to resource server config file (used to extract domain for naming convention)."
     )
     hf_dataset_prefix: str = Field(
-        default="NeMo-Gym", description="Prefix prepended to dataset name (default: 'NeMo-Gym')."
+        default="Nemotron-RL", description="Prefix prepended to dataset name (default: 'NeMo-Gym')."
+    )
+    split: Literal["train", "validation", "test"] = Field(
+        default="train",
+        description="Dataset split type (e.g., 'train', 'validation', 'test'). Format validation only applies to 'train' splits.",
+    )
+    create_pr: bool = Field(
+        default=False,
+        description="Create a pull request instead of pushing directly. Required for repos where you do not have write access.",
+    )
+    revision: Optional[str] = Field(
+        default=None,
+        description="Git revision (branch name) to upload to. Use the same revision for multiple files to upload to the same PR. If not provided with create_pr=True, a new branch/PR will be created automatically.",
+    )
+    commit_message: Optional[str] = Field(
+        default=None, description="Custom commit message. If not provided, HuggingFace auto-generates one."
+    )
+    commit_description: Optional[str] = Field(
+        default=None, description="Optional commit description with additional context."
     )
 
 

diff --git a/nemo_gym/hf_utils.py b/nemo_gym/hf_utils.py
@@ -33,7 +33,7 @@ def create_huggingface_client(token: str) -> HfApi:  # pragma: no cover
 
 def check_jsonl_format(file_path: str) -> bool:  # pragma: no cover
     """Check for the presence of the expected keys in the dataset"""
-    required_keys = {"responses_create_params", "reward_profiles", "expected_answer"}
+    required_keys = {"responses_create_params"}
     missing_keys_info = []
 
     try:
@@ -49,7 +49,7 @@ def check_jsonl_format(file_path: str) -> bool:  # pragma: no cover
             return False
 
     except (FileNotFoundError, json.JSONDecodeError) as e:
-        print(f"[Nemo-Gym] - Error reading or prasing the JSON file: {e}")
+        print(f"[Nemo-Gym] - Error reading or parsing the JSON file: {e}")
         return False
 
     return True
@@ -123,15 +123,19 @@ def upload_jsonl_dataset(
     with open(config.resource_config_path, "r") as f:
         data = yaml.safe_load(f)
 
-    domain = d.title() if (d := visit_resource_server(data).to_dict().get("domain")) else None
+    domain = d.lower() + "-" if (d := visit_resource_server(data).to_dict().get("domain")) else ""
     resource_server = config.resource_config_path.split("/")[1]
-    dataset_name = config.dataset_name
+    dataset_name = config.dataset_name or resource_server
     prefix = config.hf_dataset_prefix + "-" if config.hf_dataset_prefix else ""
-    repo_id = f"{config.hf_organization}/{prefix}{domain}-{resource_server}-{dataset_name}"
-    collection_id = f"{config.hf_organization}/{config.hf_collection_name}-{config.hf_collection_slug}"
+    collection_id = (
+        f"{config.hf_organization}/{config.hf_collection_name.lower().replace(' ', '-')}-{config.hf_collection_slug}"
+    )
 
-    # Dataset format check
-    if not check_jsonl_format(config.input_jsonl_fpath):
+    repo_id = f"{config.hf_organization}/{prefix}{domain}{dataset_name}"
+
+    # Dataset format check - only strict check for training data
+    is_training = config.split.lower() == "train"
+    if is_training and not check_jsonl_format(config.input_jsonl_fpath):
         print("[Nemo-Gym] - JSONL file format check failed.")
         return
 
@@ -140,8 +144,11 @@ def upload_jsonl_dataset(
         client.create_repo(repo_id=repo_id, token=config.hf_token, repo_type="dataset", private=True, exist_ok=True)
         print(f"[Nemo-Gym] - Repo '{repo_id}' is ready for use")
     except HfHubHTTPError as e:
-        print(f"[Nemo-Gym] - Error creating repo: {e}")
-        raise
+        if config.create_pr and "403" in str(e):
+            print(f"[Nemo-Gym] - Repo '{repo_id}' exists (no create permission, will create PR)")
+        else:
+            print(f"[Nemo-Gym] - Error creating repo: {e}")
+            raise
 
     # Collection id + addition
     try:
@@ -159,14 +166,21 @@ def upload_jsonl_dataset(
 
     # File upload
     try:
-        client.upload_file(
+        commit_info = client.upload_file(
             path_or_fileobj=config.input_jsonl_fpath,
             path_in_repo=Path(config.input_jsonl_fpath).name,
             repo_id=repo_id,
             token=config.hf_token,
             repo_type="dataset",
+            create_pr=config.create_pr,
+            revision=config.revision,
+            commit_message=config.commit_message,
+            commit_description=config.commit_description,
         )
-        print("[Nemo-Gym] - Dataset uploaded successful")
+        if config.create_pr:
+            print(f"[Nemo-Gym] - Pull Request created: {commit_info.pr_url}")
+        else:
+            print("[Nemo-Gym] - Dataset upload successful")
     except HfHubHTTPError as e:
         print(f"[Nemo-Gym] - Error uploading file: {e}")
         raise
diff --git a/resources_servers/math_with_judge/configs/math_with_judge.yaml b/resources_servers/math_with_judge/configs/math_with_judge.yaml
@@ -42,6 +42,9 @@ math_with_judge_simple_agent:
           dataset_name: aime24
           version: 0.0.1
           artifact_fpath: aime24.jsonl
+        huggingface_identifier:
+          repo_id: nvidia/Nemotron-RL-math-OpenMathReasoning
+          artifact_fpath: aime24_validation.jsonl
         license: Apache 2.0
       - name: example
         type: example

diff --git a/resources_servers/structured_outputs/configs/structured_outputs_json.yaml b/resources_servers/structured_outputs/configs/structured_outputs_json.yaml
@@ -35,6 +35,9 @@ structured_outputs_simple_agent:
           dataset_name: structured_outputs_251027_nano_v3_sdg_json_val
           version: 0.0.2
           artifact_fpath: structured_outputs_251027_nano_v3_sdg_json_val.jsonl
+        huggingface_identifier:
+          repo_id: nvidia/Nemotron-RL-instruction_following-structured_outputs
+          artifact_fpath: structured_outputs_251027_nano_v3_sdg_json_val.jsonl
         license: Apache 2.0
       - name: example
         type: example

diff --git a/resources_servers/workplace_assistant/configs/workplace_assistant.yaml b/resources_servers/workplace_assistant/configs/workplace_assistant.yaml
@@ -35,6 +35,9 @@ workplace_assistant_simple_agent:
             dataset_name: workplace_assistant
             version: 0.0.4
             artifact_fpath: validation.jsonl
+          huggingface_identifier:
+            repo_id: nvidia/Nemotron-RL-agent-workplace_assistant
+            artifact_fpath: validation.jsonl
           license: Apache 2.0
         - name: example
           type: example