Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 47 additions & 6 deletions docs/reference/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,27 +14,27 @@ Tests are strongly encouraged and you must have at least one test for every serv


# How To: Upload and download a dataset from HuggingFace
The huggingface client requires that your credentials are in `env.yaml`, along with some other pertinent details needed to upload to the designated place.
The huggingface client requires that your credentials are in `env.yaml`, along with some other pertinent details needed to upload to the designated place.
```yaml
hf_token: {your huggingface token}
hf_organization: {your huggingface org}
hf_collection_name: {your collection}
hf_collection_slug: {your collection slug} # alphanumeric string found at the end of a collection URI

# optional:
hf_dataset_prefix: str # field to override the default value "NeMo-Gym" prepended to the dataset name
hf_dataset_prefix: str # field to override the default value "Nemotron-RL" prepended to the dataset name
```

Naming convention for Huggingface datasets is as follows.

`{hf_organization}/{hf_dataset_prefix}-{domain}–{resource_server_name}-{your dataset name}`
`{hf_organization}/{hf_dataset_prefix}-{domain}–{resource_server OR dataset_name}`

E.g.:

`NVIDIA/Nemo-Gym-Math-math_with_judge-dapo17k`
`nvidia/Nemotron-RL-math-OpenMathReasoning`


You will only need to manually input the `{your dataset name}` portion of the above when inputting the `dataset_name` flag in the upload command (refer to the command below). Everything preceding it will be automatically populated using your config prior to upload.
You will only need to manually input the `{dataset_name}` portion of the above when inputting the `dataset_name` flag in the upload command (refer to the command below). Everything preceding it will be automatically populated using your config prior to upload. Note that it is optional, and overrides `resource_server` if used.

To upload to Huggingface, use the below command:
```bash
Expand All @@ -47,6 +47,45 @@ ng_upload_dataset_to_hf \

Because of the required dataset nomenclature, the resource server config path is required when uploading. Specifically, `domain` is used in the naming of a dataset in Huggingface.

By default, the `split` parameter for uploading is set to `train`, which will run a check on the required fields `{"responses_create_params", "reward_profiles", "expected_answer"}`. Specifying `validation` or `test` bypasses this check:

```bash
resource_config_path="resources_servers/multineedle/configs/multineedle.yaml"
ng_gitlab_to_hf_dataset \
+dataset_name={your dataset name} \
+input_jsonl_fpath=data/multineedle_benchmark_validation.jsonl \
+resource_config_path=${resource_config_path} \
+split=validation
```

## Uploading with Pull Request workflow
When uploading to an organization repository where you don't have direct write access (e.g., nvidia/), use the `+create_pr=true` flag to create a Pull Request instead of pushing directly. You can also customize the commit message and description.

If you want to specify the revision (branch name), you can add the `+revision={your branch name}` flag. Excluding `create_pr` (or setting it to `false`) assumes you are committing to an existing branch. Including it assumes it will be a brand new branch.

```bash
ng_upload_dataset_to_hf \
+dataset_name=OpenMathReasoning \
+input_jsonl_fpath=data/validation.jsonl \
+resource_config_path=${resource_config_path} \
+split=validation \
+create_pr=true \
+revision=my-branch-name \
+commit_message="Add validation set" \
+commit_description="Includes 545 examples"
```

The command will output a link to the created Pull Request:
```bash
[Nemo-Gym] - Pull Request created: https://huggingface.co/datasets/nvidia/Nemotron-RL-math-OpenMathReasoning/discussions/1
```

:::{note}
The commit_message and commit_description parameters work for both direct pushes and Pull Requests. If not provided, HuggingFace auto-generates a commit message based on the filename.
:::


## Deleting Datasets from Gitlab
You can optionally pass a `+delete_from_gitlab=true` flag to the above command, which will delete the model and all of its artifacts from Gitlab. By default, this is set to `False`.
```bash
resource_config_path="resources_servers/multineedle/configs/multineedle.yaml"
Expand All @@ -59,7 +98,7 @@ ng_upload_dataset_to_hf \

There will be a confirmation dialog to confirm the deletion:
```bash
[Nemo-Gym] - Dataset uploaded successful
[Nemo-Gym] - Dataset upload successful
[Nemo-Gym] - Found model 'fs-test' in the registry. Are you sure you want to delete it from Gitlab? [y/N]:
```

Expand All @@ -83,6 +122,8 @@ ng_delete_dataset_from_gitlab \
Gitlab model names are case sensitive. There can be models named 'My_Model' and 'my_model' living simultaneously in the registry. When uploading to Huggingface with the intention of deleting Gitlab artifacts, be sure the casing of your Huggingface dataset name matches that of Gitlab's.
:::


## Downloading Datasets from Huggingface
Downloading a dataset from Huggingface is straightforward:

**For structured datasets (with train/validation/test splits):**
Expand Down
24 changes: 21 additions & 3 deletions nemo_gym/config_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,15 +222,33 @@ class BaseUploadJsonlDatasetHuggingFaceConfig(BaseNeMoGymCLIConfig):
hf_organization: str = Field(description="HuggingFace organization name where dataset will be uploaded.")
hf_collection_name: str = Field(description="HuggingFace collection name for organizing datasets.")
hf_collection_slug: str = Field(description="Alphanumeric collection slug found at the end of collection URI.")
dataset_name: str = Field(
description="Name of the dataset (will be combined with domain and resource server name)."
dataset_name: Optional[str] = Field(
default=None, description="Name of the dataset (will be combined with domain and resource server name)."
)
input_jsonl_fpath: str = Field(description="Path to the local jsonl file to upload.")
resource_config_path: str = Field(
description="Path to resource server config file (used to extract domain for naming convention)."
)
hf_dataset_prefix: str = Field(
default="NeMo-Gym", description="Prefix prepended to dataset name (default: 'NeMo-Gym')."
default="Nemotron-RL", description="Prefix prepended to dataset name (default: 'NeMo-Gym')."
)
split: Literal["train", "validation", "test"] = Field(
default="train",
description="Dataset split type (e.g., 'train', 'validation', 'test'). Format validation only applies to 'train' splits.",
)
create_pr: bool = Field(
default=False,
description="Create a pull request instead of pushing directly. Required for repos where you do not have write access.",
)
revision: Optional[str] = Field(
default=None,
description="Git revision (branch name) to upload to. Use the same revision for multiple files to upload to the same PR. If not provided with create_pr=True, a new branch/PR will be created automatically.",
)
commit_message: Optional[str] = Field(
default=None, description="Custom commit message. If not provided, HuggingFace auto-generates one."
)
commit_description: Optional[str] = Field(
default=None, description="Optional commit description with additional context."
)


Expand Down
38 changes: 26 additions & 12 deletions nemo_gym/hf_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def create_huggingface_client(token: str) -> HfApi: # pragma: no cover

def check_jsonl_format(file_path: str) -> bool: # pragma: no cover
"""Check for the presence of the expected keys in the dataset"""
required_keys = {"responses_create_params", "reward_profiles", "expected_answer"}
required_keys = {"responses_create_params"}
missing_keys_info = []

try:
Expand All @@ -49,7 +49,7 @@ def check_jsonl_format(file_path: str) -> bool: # pragma: no cover
return False

except (FileNotFoundError, json.JSONDecodeError) as e:
print(f"[Nemo-Gym] - Error reading or prasing the JSON file: {e}")
print(f"[Nemo-Gym] - Error reading or parsing the JSON file: {e}")
return False

return True
Expand Down Expand Up @@ -123,15 +123,19 @@ def upload_jsonl_dataset(
with open(config.resource_config_path, "r") as f:
data = yaml.safe_load(f)

domain = d.title() if (d := visit_resource_server(data).to_dict().get("domain")) else None
domain = d.lower() + "-" if (d := visit_resource_server(data).to_dict().get("domain")) else ""
resource_server = config.resource_config_path.split("/")[1]
dataset_name = config.dataset_name
dataset_name = config.dataset_name or resource_server
prefix = config.hf_dataset_prefix + "-" if config.hf_dataset_prefix else ""
repo_id = f"{config.hf_organization}/{prefix}{domain}-{resource_server}-{dataset_name}"
collection_id = f"{config.hf_organization}/{config.hf_collection_name}-{config.hf_collection_slug}"
collection_id = (
f"{config.hf_organization}/{config.hf_collection_name.lower().replace(' ', '-')}-{config.hf_collection_slug}"
)

# Dataset format check
if not check_jsonl_format(config.input_jsonl_fpath):
repo_id = f"{config.hf_organization}/{prefix}{domain}{dataset_name}"

# Dataset format check - only strict check for training data
is_training = config.split.lower() == "train"
if is_training and not check_jsonl_format(config.input_jsonl_fpath):
print("[Nemo-Gym] - JSONL file format check failed.")
return

Expand All @@ -140,8 +144,11 @@ def upload_jsonl_dataset(
client.create_repo(repo_id=repo_id, token=config.hf_token, repo_type="dataset", private=True, exist_ok=True)
print(f"[Nemo-Gym] - Repo '{repo_id}' is ready for use")
except HfHubHTTPError as e:
print(f"[Nemo-Gym] - Error creating repo: {e}")
raise
if config.create_pr and "403" in str(e):
print(f"[Nemo-Gym] - Repo '{repo_id}' exists (no create permission, will create PR)")
else:
print(f"[Nemo-Gym] - Error creating repo: {e}")
raise

# Collection id + addition
try:
Expand All @@ -159,14 +166,21 @@ def upload_jsonl_dataset(

# File upload
try:
client.upload_file(
commit_info = client.upload_file(
path_or_fileobj=config.input_jsonl_fpath,
path_in_repo=Path(config.input_jsonl_fpath).name,
repo_id=repo_id,
token=config.hf_token,
repo_type="dataset",
create_pr=config.create_pr,
revision=config.revision,
commit_message=config.commit_message,
commit_description=config.commit_description,
)
print("[Nemo-Gym] - Dataset uploaded successful")
if config.create_pr:
print(f"[Nemo-Gym] - Pull Request created: {commit_info.pr_url}")
else:
print("[Nemo-Gym] - Dataset upload successful")
except HfHubHTTPError as e:
print(f"[Nemo-Gym] - Error uploading file: {e}")
raise
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ math_with_judge_simple_agent:
dataset_name: aime24
version: 0.0.1
artifact_fpath: aime24.jsonl
huggingface_identifier:
repo_id: nvidia/Nemotron-RL-math-OpenMathReasoning
artifact_fpath: aime24_validation.jsonl
license: Apache 2.0
- name: example
type: example
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ structured_outputs_simple_agent:
dataset_name: structured_outputs_251027_nano_v3_sdg_json_val
version: 0.0.2
artifact_fpath: structured_outputs_251027_nano_v3_sdg_json_val.jsonl
huggingface_identifier:
repo_id: nvidia/Nemotron-RL-instruction_following-structured_outputs
artifact_fpath: structured_outputs_251027_nano_v3_sdg_json_val.jsonl
license: Apache 2.0
- name: example
type: example
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ workplace_assistant_simple_agent:
dataset_name: workplace_assistant
version: 0.0.4
artifact_fpath: validation.jsonl
huggingface_identifier:
repo_id: nvidia/Nemotron-RL-agent-workplace_assistant
artifact_fpath: validation.jsonl
license: Apache 2.0
- name: example
type: example
Expand Down