Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 88 additions & 18 deletions docs/data/download-huggingface.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,71 @@ ng_download_dataset_from_hf \

::::

::::{tab-item} Python Script
Downloads using the `datasets` library directly with streaming support.

**Use when**: You need custom preprocessing, streaming for large datasets, or specific split handling.

```python
import json
from datasets import load_dataset

output_file = "train.jsonl"
dataset_name = "nvidia/OpenMathInstruct-2"
split_name = "train_1M" # Check dataset page for available splits

with open(output_file, "w", encoding="utf-8") as f:
for line in load_dataset(dataset_name, split=split_name, streaming=True):
f.write(json.dumps(line) + "\n")
```

Run the script:

```bash
uv run download.py
```

Verify the download:

```bash
wc -l train.jsonl
# Expected: 1000000 train.jsonl
```

**Streaming benefits**:
- Memory-efficient for large datasets (millions of rows)
- Progress visible during download

:::{note}
For gated or private datasets, authenticate first:

```bash
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
```

Or use `huggingface-cli login` before running the script.
:::

::::

:::::

---

## NVIDIA Datasets

Ready-to-use datasets for common training tasks:

| Dataset | Repository | Domain |
|---------|-----------|--------|
| OpenMathReasoning | `nvidia/Nemotron-RL-math-OpenMathReasoning` | Math |
| Competitive Coding | `nvidia/nemotron-RL-coding-competitive_coding` | Code |
| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant` | Agent |
| Structured Outputs | `nvidia/Nemotron-RL-instruction_following-structured_outputs` | Instruction |
| MCQA | `nvidia/Nemotron-RL-knowledge-mcqa` | Knowledge |

---

## Troubleshooting

::::{dropdown} Authentication Failed (401)
Expand Down Expand Up @@ -147,7 +208,7 @@ Avoid passing tokens on the command line—they appear in shell history.
**Recommended** — Use environment variable:

```bash
export hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
ng_download_dataset_from_hf \
+repo_id=my-org/private-dataset \
+output_dirpath=./data/
Expand All @@ -168,20 +229,6 @@ ng_download_dataset_from_hf \
```
:::

---

## NVIDIA Datasets

| Dataset | Repository |
|---------|-----------|
| OpenMathReasoning | `nvidia/Nemotron-RL-math-OpenMathReasoning` |
| Competitive Coding | `nvidia/nemotron-RL-coding-competitive_coding` |
| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant` |
| Structured Outputs | `nvidia/Nemotron-RL-instruction_following-structured_outputs` |
| MCQA | `nvidia/Nemotron-RL-knowledge-mcqa` |

---

:::{dropdown} Automatic Downloads During Data Preparation
:icon: download

Expand Down Expand Up @@ -238,7 +285,30 @@ rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
| Auto-download | `nemo_gym/train_data_utils.py:476-494` |
:::

## Related
## Next Steps

::::{grid} 1 2 2 2
:gutter: 3

:::{grid-item-card} {octicon}`checklist;1.5em;sd-mr-1` Prepare and Validate
:link: prepare-validate
:link-type: doc

Preprocess raw data, run `ng_prepare_data`, and add `agent_ref` routing.
:::

:::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Collect Rollouts
:link: /get-started/rollout-collection
:link-type: doc

- {doc}`prepare-validate` — Validate downloaded datasets
- {doc}`/reference/cli-commands` — Full CLI reference
Generate training examples by running your agent on prepared data.
:::

:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Train with NeMo RL
:link: /tutorials/nemo-rl-grpo/index
:link-type: doc

Use validated data with NeMo RL for GRPO training.
:::

::::
34 changes: 33 additions & 1 deletion docs/data/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
(data-index)=
# Data

NeMo Gym datasets use JSONL format for reinforcement learning (RL) training. Each dataset connects to an agent server—the component that orchestrates agent-environment interactions during training.
NeMo Gym datasets use JSONL format for reinforcement learning (RL) training. Each dataset connects to an **agent server** (orchestrates agent-environment interactions) which routes requests to a **resources server** (provides tools and computes rewards).

## Prerequisites

Expand All @@ -28,6 +28,38 @@ Additional fields like `expected_answer` vary by resources server—the componen

**Source**: `nemo_gym/base_resources_server.py:35-36`

### Required Fields

| Field | Added By | Description |
|-------|----------|-------------|
| `responses_create_params` | User | Input to the model during training. Contains `input` (messages) and optional `tools`, `temperature`, etc. |
| `agent_ref` | `ng_prepare_data` | Routes each row to its resource server. Auto-generated during data preparation. |

### Optional Fields

| Field | Description |
|-------|-------------|
| `expected_answer` | Ground truth for verification (task-specific). |
| `question` | Original question text (for reference). |
| `id` | Tracking identifier. |

:::{tip}
Check `resources_servers/<name>/README.md` for fields required by each resource server's `verify()` method.
:::

### The `agent_ref` Field

The `agent_ref` field maps each row to a specific resource server. A training dataset can blend multiple resource servers in a single file—`agent_ref` tells NeMo Gym which server handles each row.

```json
{
"responses_create_params": {"input": [{"role": "user", "content": "..."}]},
"agent_ref": {"type": "responses_api_agents", "name": "math_with_judge_simple_agent"}
}
```

**You don't create `agent_ref` manually.** The `ng_prepare_data` tool adds it automatically based on your config file. The tool matches the agent type (`responses_api_agents`) with the agent name from the config.

### Example Data

```json
Expand Down
140 changes: 137 additions & 3 deletions docs/data/prepare-validate.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ Success output:
####################################################################################################
```

This generates `data/test/example_metrics.json` with dataset statistics.
This generates two types of output:
- **Per-dataset metrics**: `resources_servers/example_multi_step/data/example_metrics.json` (alongside source JSONL)
- **Aggregated metrics**: `data/test/example_metrics.json` (in output directory)

---

Expand Down Expand Up @@ -85,6 +87,130 @@ Check `resources_servers/<name>/README.md` for required fields specific to each

---

## Preprocess Raw Datasets

If your dataset doesn't have `responses_create_params`, you need to preprocess it before using `ng_prepare_data`.

**When to preprocess**:
- Downloaded datasets without NeMo Gym format
- Custom data needing system prompts
- Need to split into train/validation sets

### Add `responses_create_params`

The `responses_create_params` field wraps your input in the Responses API format. This typically includes a system prompt and the user content.

::::{dropdown} Preprocessing script (preprocess.py)
:icon: code
:open:

Save this script as `preprocess.py`. It reads a raw JSONL file, adds `responses_create_params`, and splits into train/validation:

```python
import json
import os

# Configuration — customize these for your dataset
INPUT_FIELD = "problem" # Field containing the input text (e.g., "problem", "question", "prompt")
FILENAME = "raw_data.jsonl"
SYSTEM_PROMPT = "Your task is to solve a math problem. Put the answer inside \\boxed{}."
TRAIN_RATIO = 0.999 # 99.9% train, 0.1% validation

dirpath = os.path.dirname(FILENAME) or "."
with open(FILENAME, "r", encoding="utf-8") as fin, \
open(os.path.join(dirpath, "train.jsonl"), "w", encoding="utf-8") as ftrain, \
open(os.path.join(dirpath, "validation.jsonl"), "w", encoding="utf-8") as fval:

lines = list(fin)
split_idx = int(len(lines) * TRAIN_RATIO)

for i, line in enumerate(lines):
if not line.strip():
continue
row = json.loads(line)

# Remove fields not needed for training (optional)
row.pop("generated_solution", None)
row.pop("problem_source", None)

# Add responses_create_params
row["responses_create_params"] = {
"input": [
{"role": "developer", "content": SYSTEM_PROMPT},
{"role": "user", "content": row.get(INPUT_FIELD, "")},
]
}

out = json.dumps(row) + "\n"
(ftrain if i < split_idx else fval).write(out)
```

:::{important}
You must customize these variables for your dataset:
- `INPUT_FIELD`: The field name containing your input text. Common values: `"problem"` (math), `"question"` (QA), `"prompt"` (general), `"instruction"` (instruction-following)
- `SYSTEM_PROMPT`: Task-specific instructions for the model
- `TRAIN_RATIO`: Train/validation split ratio
:::

::::

Run and verify:

```bash
uv run preprocess.py
wc -l train.jsonl validation.jsonl
```

### Create Config for Custom Data

After preprocessing, create a config file to point `ng_prepare_data` at your local files.

::::{dropdown} Example config: custom_data.yaml
:icon: file-code

```yaml
custom_resources_server:
resources_servers:
custom_server:
entrypoint: app.py
domain: math # math | coding | agent | knowledge | other
description: Custom math dataset
verified: false

custom_simple_agent:
responses_api_agents:
simple_agent:
entrypoint: app.py
resources_server:
type: resources_servers
name: custom_resources_server
model_server:
type: responses_api_models
name: policy_model
datasets:
- name: train
type: train
jsonl_fpath: train.jsonl
license: Creative Commons Attribution 4.0 International
- name: validation
type: validation
jsonl_fpath: validation.jsonl
license: Creative Commons Attribution 4.0 International
```

::::

Run data preparation:

```bash
config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,custom_data.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data
```

This validates your data and adds the `agent_ref` field to each row, routing samples to your resource server.

---

## Validation Modes

| Mode | Purpose | Validates |
Expand Down Expand Up @@ -130,7 +256,9 @@ ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/wo
| Invalid role | Sample skipped | Use `user`, `assistant`, `system`, or `developer` |
| Missing dataset file | `AssertionError` | Create file or set `+should_download=true` |

**Key behavior**: Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data.
:::{warning}
Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.
:::

::::{dropdown} Find invalid samples
:icon: code
Expand Down Expand Up @@ -174,9 +302,15 @@ with open("your_data.jsonl") as f:
4. **Compute metrics** — Aggregate statistics
5. **Collate** — Combine samples with agent references

### Output Locations

Metrics files are written to two locations:
- **Per-dataset**: `{dataset_jsonl_path}_metrics.json` — alongside each source JSONL file
- **Aggregated**: `{output_dirpath}/{type}_metrics.json` — combined metrics per dataset type

### Re-Running

- **Output files** (`train.jsonl`, `validation.jsonl`) are overwritten
- **Output files** (`train.jsonl`, `validation.jsonl`) are overwritten in `output_dirpath`
- **Metrics files** (`*_metrics.json`) are compared — delete them if your data changed

### Generated Metrics
Expand Down
Loading