Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/design-docs/fsdp2-parallel-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ The Hugging Face tensor parallel plan is the default. It's available for most mo

## Custom Parallel Plan Example

A custom parallel plan should be defined in a separate file, such as the example provided in `examples/custom_parallel.py`.
A custom parallel plan should be defined in a separate file, such as the example provided in `examples/custom_parallel/custom_parallel.py`.

To implement the custom parallel plan, either update the value of `custom_parallel_plan` in the `yaml` file directly, or pass the override via the command line. For example:

```bash
uv run examples/run_grpo_math.py \
policy.dtensor_cfg.custom_parallel_plan=examples.custom_parallel.custom_parallel_plan
policy.dtensor_cfg.custom_parallel_plan=examples.custom_parallel.custom_parallel.custom_parallel_plan
```
42 changes: 42 additions & 0 deletions docs/guides/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,48 @@ code_env = CodeEnvironment.remote(env_config)

We’re tracking an end-to-end example of this environment in [#858](https://github.com/NVIDIA-NeMo/RL/issues/858). Add a 👍 to show your interest.

## Code Jaccard Environment

The Code Jaccard Environment evaluates code (or text) responses by measuring Jaccard-based similarity against ground-truth answers. This is a lightweight, text-similarity reward useful when an execution sandbox is unnecessary or unavailable.

### How it works
- Extracts the assistant’s response text from each conversation.
- Computes a Jaccard similarity score between the response and ground truth:
- Tokenizes both texts (whitespace), computes intersection/union, then applies a length ratio penalty.
- Scores are in [0, 1]. Observations label responses as “aligned/misaligned” using a 0.5 threshold.
- Returns:
- observations: environment feedback strings
- rewards: tensor of similarity scores
- terminateds: all ones (single-step episodes)
- answers: optional, the response text when requested

### Usage
```python
from nemo_rl.environments.code_jaccard_environment import CodeJaccardEnvironment

env_config = {
"num_workers": 2,
# Optional default stop strings (unused in scoring but available for consistency)
"stop_strings": None,
}

code_jaccard_env = CodeJaccardEnvironment.remote(env_config)
```

### Configuration
- `num_workers` (int): Number of parallel verification workers.
- `stop_strings` (list[str] | None): Optional default stop strings (propagated downstream; not required for scoring).

### Sample GRPO config
```yaml
env:
code_jaccard:
num_workers: 2
stop_strings: null
data:
env_name: code_jaccard
```

## Reward Model Environment

The Reward Model Environment uses pre-trained reward models to score conversation quality.
Expand Down
96 changes: 88 additions & 8 deletions docs/guides/grpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ We recommend launching the job using `uv`:
uv run examples/run_grpo_math.py --config <PATH TO YAML CONFIG> {overrides}
```

If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml).
If not specified, `config` will default to [examples/configs/grpo_math_1B.yaml](../../examples/configs/grpo_math_1B.yaml).

**Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.

Expand Down Expand Up @@ -85,6 +85,37 @@ def my_data_processor(

We have an example of this as `math_data_processor` in [processors.py](../../nemo_rl/data/processors.py).

### Task–Dataset Mapping

- task_name (unique task identifier):
- Determines which processor, env, prompts, and dataset to use for this task.
- Currently we support a single dataset and a single environment. Therefore, task_name equals the dataset_name in config (i.e., config.data.dataset_name).
- task_spec (TaskDataSpec):
- Specifies per-task system prompt and prompt (with defaults applied from a global spec when unspecified).
- task_data_processors:
- Dict mapping: task_name -> (task_spec, processor_fn).
- Typical flow: provide a default mapping via defaultdict, then explicitly register the dataset-provided processor under the resolved task_name.

Example (simplified):

```python
default_task_spec = TaskDataSpec(
task_name="math_default",
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)

task_data_processors: dict[str, tuple[TaskDataSpec, TaskDataProcessFnCallable]] = defaultdict(
lambda: (default_task_spec, math_hf_data_processor)
)

# Resolve task_name from dataset or spec
task_spec = data.task_spec
task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name
assert hasattr(data, "processor"), "Dataset must have a processor attribute"
task_data_processors[task_name] = (task_spec, data.processor)
```

#### Putting It All Together

GRPO expects datasets to have the following form:
Expand All @@ -96,21 +127,51 @@ GRPO expects datasets to have the following form:
Then, you can set the data up as follows:

```python
base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"]
tokenizer = get_tokenizer(tokenizer_config)

task_data_processors = defaultdict(lambda: (math_task_spec, math_data_processor))
task_data_processors["math"] = (math_task_spec, math_data_processor)
# 1) Select environment from data config
env_name = data_config["env_name"]
env = get_env(env_name=env_name, env_configs=env_configs)

math_env = MathEnvironment.remote(env_configs["math"]) # ray remote actor
# 2) Build default TaskDataSpec from config (prompts loaded from files if present)
default_task_spec = TaskDataSpec(
task_name="math_default",
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)

# 3) Define default processor mapping
task_data_processors: dict[str, tuple[TaskDataSpec, TaskDataProcessFnCallable]] = defaultdict(
lambda: (default_task_spec, math_hf_data_processor)
)

# 4) Load dataset via helper (built-ins or local/HF datasets)
data = load_response_dataset(data_config, seed)

# 5) Resolve task spec/name and ensure dataset provides a processor
task_spec = data.task_spec
task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name
assert hasattr(data, "processor"), "Dataset must have a processor attribute"
task_data_processors[task_name] = (task_spec, data.processor)

# 6) Construct processed datasets (train and optional validation)
dataset = AllTaskProcessedDataset(
base_dataset,
data.formatted_ds["train"],
tokenizer,
math_task_spec,
default_task_spec,
task_data_processors,
max_seq_length=data_config["max_input_seq_length"],
)
val_dataset = (
AllTaskProcessedDataset(
data.formatted_ds["validation"],
tokenizer,
default_task_spec,
task_data_processors,
max_seq_length=data_config["max_input_seq_length"],
)
if data.formatted_ds["validation"]
else None
)
```

Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples.
Expand All @@ -121,6 +182,25 @@ GRPO supports various types of environments for different tasks, including **[Ma

For more information about environments, see the [Environments Guide](environments.md).

### Env–Task Mapping

- env:
- The environment actor for reward/evaluation, constructed via `get_env(env_name=..., env_configs=...)`.
- The environment to use is declared under the data section of the config (e.g., `data.env_name` states which env the dataset uses).
- task_to_env:
- Dict mapping: task_name -> env. In the current single-task setup this typically points all tasks to the same env, but this structure enables different envs per task in future multi-task scenarios.

Example (simplified):

```python
env_name = data_config["env_name"] # declared under config.data
env = get_env(env_name=env_name, env_configs=env_configs)

task_to_env: dict[str, EnvironmentInterface] = defaultdict(lambda: env)
task_to_env[task_name] = env
val_task_to_env = task_to_env # validation usually mirrors training mapping
```

## Policy Model

We define a {py:class}`PolicyInterface]() <nemo_rl.models.interfaces>` that contains everything you need to train a Policy model.
Expand Down
7 changes: 2 additions & 5 deletions examples/configs/grpo_math_1B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,8 @@ data:
system_prompt_file: null
shuffle: true
num_workers: 1

processor: "math_hf_data_processor"
env_name: "math"
dataset_name: "OpenMathInstruct-2"
# You can use custom response datasets for training and validation. For example:
# data:
Expand All @@ -264,10 +265,6 @@ env:
math:
num_workers: 8
math_verify_impl: "hf_math_verify"
## unused in this config but needed for DAPO recipe
dapo:
num_workers: 8
math_verify_impl: "dapo_math_verify"

logger:
log_dir: "logs" # Base directory for all logs
Expand Down
6 changes: 5 additions & 1 deletion examples/configs/grpo_rm_1B.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,20 @@
# GRPO Algorithm Configuration
defaults: "grpo_math_1B.yaml"

data:
env_name: "reward_model"

env:
reward_model:
enabled: true
processor: "math_hf_data_processor"
model_name: "Skywork/Skywork-Reward-V2-Qwen3-0.6B"
tokenizer:
name: ${env.reward_model.model_name}
precision: "bfloat16"
batch_size: ${policy.train_micro_batch_size}
checkpoint_path: null
max_model_len: 2048
offload_optimizer_for_logprob: false
resources:
gpus_per_node: 1
num_nodes: 1
Expand Down
2 changes: 0 additions & 2 deletions examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,6 @@ data:
prompt_file: null
dataset_name: DAPOMath17K
env:
dapo:
num_workers: 16
math:
num_workers: 16
math_verify_impl: "dapo_math_verify"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
defaults: ../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 64
max_num_steps: 10
checkpointing:
checkpoint_dir: results/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
metric_name: val:reward
policy:
# This is the model name is unusable because the model did bot update on huggingface yet.
# ISSUE: https://github.com/NVIDIA-NeMo/RL/issues/1571
model_name: nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
max_total_sequence_length: 32768
train_global_batch_size: 64
train_micro_batch_size: 1
logprob_batch_size: 1
dtensor_cfg:
activation_checkpointing: true
context_parallel_size: 4
cpu_offload: true
tensor_parallel_size: 8
custom_parallel_plan: examples.custom_parallel.llama_nemotron_super_49b_custom_plan.custom_parallel_plan
dynamic_batching:
enabled: true
sequence_packing:
enabled: false
optimizer:
kwargs:
lr: 3.0e-07
scheduler:
- name: torch.optim.lr_scheduler.LinearLR
kwargs:
start_factor: 0.1
end_factor: 1.0
total_iters: 13
- name: torch.optim.lr_scheduler.ConstantLR
kwargs:
factor: 1.0
total_iters: 10000000000
- milestones:
- 13
generation:
vllm_cfg:
tensor_parallel_size: 4
data:
# Training with HelpSteer3 will lead to high logprob error.
# ISSUE: https://github.com/NVIDIA-NeMo/RL/issues/1570
prompt_file: null
dataset_name: HelpSteer3
split: preference
env_name: "code_jaccard"
processor: helpsteer3_data_processor
env:
code_jaccard:
num_workers: 8
logger:
wandb_enabled: true
wandb:
project: grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
name: grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-tp${policy.dtensor_cfg.tensor_parallel_size}
tensorboard:
log_dir: tb_logs-grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
mlflow:
experiment_name: grpo-helpsteer3
run_name: grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
cluster:
gpus_per_node: 8
num_nodes: 16

This file was deleted.

Loading
Loading