NVIDIA-NeMo · yuki-97 · Nov 3, 2025 · Nov 3, 2025 · Nov 4, 2025 · Nov 9, 2025
@@ -20,11 +20,11 @@ The Hugging Face tensor parallel plan is the default. It's available for most mo
 
 ## Custom Parallel Plan Example
 
-A custom parallel plan should be defined in a separate file, such as the example provided in `examples/custom_parallel.py`.
+A custom parallel plan should be defined in a separate file, such as the example provided in `examples/custom_parallel/custom_parallel.py`.
 
 To implement the custom parallel plan, either update the value of `custom_parallel_plan` in the `yaml` file directly, or pass the override via the command line. For example:
 
 ```bash
 uv run examples/run_grpo_math.py \
-    policy.dtensor_cfg.custom_parallel_plan=examples.custom_parallel.custom_parallel_plan
+    policy.dtensor_cfg.custom_parallel_plan=examples.custom_parallel.custom_parallel.custom_parallel_plan
 ```
@@ -44,6 +44,48 @@ code_env = CodeEnvironment.remote(env_config)
 
 We’re tracking an end-to-end example of this environment in [#858](https://github.com/NVIDIA-NeMo/RL/issues/858). Add a 👍 to show your interest.
 
+## Code Jaccard Environment
+
+The Code Jaccard Environment evaluates code (or text) responses by measuring Jaccard-based similarity against ground-truth answers. This is a lightweight, text-similarity reward useful when an execution sandbox is unnecessary or unavailable.
+
+### How it works
+- Extracts the assistant’s response text from each conversation.
+- Computes a Jaccard similarity score between the response and ground truth:
+  - Tokenizes both texts (whitespace), computes intersection/union, then applies a length ratio penalty.
+  - Scores are in [0, 1]. Observations label responses as “aligned/misaligned” using a 0.5 threshold.
+- Returns:
+  - observations: environment feedback strings
+  - rewards: tensor of similarity scores
+  - terminateds: all ones (single-step episodes)
+  - answers: optional, the response text when requested
+
+### Usage
+```python
+from nemo_rl.environments.code_jaccard_environment import CodeJaccardEnvironment
+
+env_config = {
+    "num_workers": 2,
+    # Optional default stop strings (unused in scoring but available for consistency)
+    "stop_strings": None,
+}
+
+code_jaccard_env = CodeJaccardEnvironment.remote(env_config)
+```
+
+### Configuration
+- `num_workers` (int): Number of parallel verification workers.
+- `stop_strings` (list[str] | None): Optional default stop strings (propagated downstream; not required for scoring).
+
+### Sample GRPO config
+```yaml
+env:
+  code_jaccard:
+    num_workers: 2
+    stop_strings: null
+data:
+  env_name: code_jaccard
+```
+
 ## Reward Model Environment
 
 The Reward Model Environment uses pre-trained reward models to score conversation quality. 

@@ -12,7 +12,7 @@ We recommend launching the job using `uv`:
 uv run examples/run_grpo_math.py --config <PATH TO YAML CONFIG> {overrides}
 ```
 
-If not specified, `config` will default to [examples/configs/grpo.yaml](../../examples/configs/grpo_math_1B.yaml).
+If not specified, `config` will default to [examples/configs/grpo_math_1B.yaml](../../examples/configs/grpo_math_1B.yaml).
 
 **Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
@@ -85,6 +85,37 @@ def my_data_processor(
 
 We have an example of this as `math_data_processor` in [processors.py](../../nemo_rl/data/processors.py).
 
+### Task–Dataset Mapping
+
+- task_name (unique task identifier):
+  - Determines which processor, env, prompts, and dataset to use for this task.
+  - Currently we support a single dataset and a single environment. Therefore, task_name equals the dataset_name in config (i.e., config.data.dataset_name).
+- task_spec (TaskDataSpec):
+  - Specifies per-task system prompt and prompt (with defaults applied from a global spec when unspecified).
+- task_data_processors:
+  - Dict mapping: task_name -> (task_spec, processor_fn).
+  - Typical flow: provide a default mapping via defaultdict, then explicitly register the dataset-provided processor under the resolved task_name.
+
+Example (simplified):
+
+```python
+default_task_spec = TaskDataSpec(
+    task_name="math_default",
+    prompt_file=data_config["prompt_file"],
+    system_prompt_file=data_config["system_prompt_file"],
+)
+
+task_data_processors: dict[str, tuple[TaskDataSpec, TaskDataProcessFnCallable]] = defaultdict(
+    lambda: (default_task_spec, math_hf_data_processor)
+)
+
+# Resolve task_name from dataset or spec
+task_spec = data.task_spec
+task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name
+assert hasattr(data, "processor"), "Dataset must have a processor attribute"
+task_data_processors[task_name] = (task_spec, data.processor)
+```
+
 #### Putting It All Together
 
 GRPO expects datasets to have the following form:
@@ -96,21 +127,51 @@ GRPO expects datasets to have the following form:
 Then, you can set the data up as follows:
 
 ```python
-base_dataset = load_dataset("json", data_files=data_config["dataset_name"])["train"]
-tokenizer = get_tokenizer(tokenizer_config)
 
-task_data_processors = defaultdict(lambda: (math_task_spec, math_data_processor))
-task_data_processors["math"] = (math_task_spec, math_data_processor)
+# 1) Select environment from data config
+env_name = data_config["env_name"]
+env = get_env(env_name=env_name, env_configs=env_configs)
 
-math_env = MathEnvironment.remote(env_configs["math"]) # ray remote actor
+# 2) Build default TaskDataSpec from config (prompts loaded from files if present)
+default_task_spec = TaskDataSpec(
+    task_name="math_default",
+    prompt_file=data_config["prompt_file"],
+    system_prompt_file=data_config["system_prompt_file"],
+)
 
+# 3) Define default processor mapping
+task_data_processors: dict[str, tuple[TaskDataSpec, TaskDataProcessFnCallable]] = defaultdict(
+    lambda: (default_task_spec, math_hf_data_processor)
+)
+
+# 4) Load dataset via helper (built-ins or local/HF datasets)
+data = load_response_dataset(data_config, seed)
+
+# 5) Resolve task spec/name and ensure dataset provides a processor
+task_spec = data.task_spec
+task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name
+assert hasattr(data, "processor"), "Dataset must have a processor attribute"
+task_data_processors[task_name] = (task_spec, data.processor)
+
+# 6) Construct processed datasets (train and optional validation)
 dataset = AllTaskProcessedDataset(
-    base_dataset,
+    data.formatted_ds["train"],
     tokenizer,
-    math_task_spec,
+    default_task_spec,
     task_data_processors,
     max_seq_length=data_config["max_input_seq_length"],
 )
+val_dataset = (
+    AllTaskProcessedDataset(
+        data.formatted_ds["validation"],
+        tokenizer,
+        default_task_spec,
+        task_data_processors,
+        max_seq_length=data_config["max_input_seq_length"],
+    )
+    if data.formatted_ds["validation"]
+    else None
+)
 ```
 
 Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples.
@@ -121,6 +182,25 @@ GRPO supports various types of environments for different tasks, including **[Ma
 
 For more information about environments, see the [Environments Guide](environments.md).
 
+### Env–Task Mapping
+
+- env:
+  - The environment actor for reward/evaluation, constructed via `get_env(env_name=..., env_configs=...)`.
+  - The environment to use is declared under the data section of the config (e.g., `data.env_name` states which env the dataset uses).
+- task_to_env:
+  - Dict mapping: task_name -> env. In the current single-task setup this typically points all tasks to the same env, but this structure enables different envs per task in future multi-task scenarios.
+
+Example (simplified):
+
+```python
+env_name = data_config["env_name"]  # declared under config.data
+env = get_env(env_name=env_name, env_configs=env_configs)
+
+task_to_env: dict[str, EnvironmentInterface] = defaultdict(lambda: env)
+task_to_env[task_name] = env
+val_task_to_env = task_to_env  # validation usually mirrors training mapping
+```
+
 ## Policy Model
 
 We define a {py:class}`PolicyInterface]() <nemo_rl.models.interfaces>` that contains everything you need to train a Policy model.

@@ -247,7 +247,8 @@ data:
   system_prompt_file: null
   shuffle: true
   num_workers: 1
-
+  processor: "math_hf_data_processor"
+  env_name: "math"
   dataset_name: "OpenMathInstruct-2"
   # You can use custom response datasets for training and validation. For example:
   #   data:
@@ -264,10 +265,6 @@ env:
   math:
     num_workers: 8
     math_verify_impl: "hf_math_verify"
-  ## unused in this config but needed for DAPO recipe
-  dapo:
-    num_workers: 8
-    math_verify_impl: "dapo_math_verify"
 
 logger:
   log_dir: "logs"  # Base directory for all logs

@@ -1,16 +1,20 @@
 # GRPO Algorithm Configuration
 defaults: "grpo_math_1B.yaml"
 
+data:
+  env_name: "reward_model"
+
 env:
   reward_model:  
-    enabled: true
+    processor: "math_hf_data_processor"
     model_name: "Skywork/Skywork-Reward-V2-Qwen3-0.6B"
     tokenizer:
       name: ${env.reward_model.model_name}
     precision: "bfloat16"
     batch_size: ${policy.train_micro_batch_size}
     checkpoint_path: null
     max_model_len: 2048
+    offload_optimizer_for_logprob: false
     resources:
       gpus_per_node: 1
       num_nodes: 1

@@ -85,8 +85,6 @@ data:
   prompt_file: null
   dataset_name: DAPOMath17K
 env:
-  dapo:
-    num_workers: 16
   math:
     num_workers: 16
     math_verify_impl: "dapo_math_verify"

@@ -0,0 +1,67 @@
+defaults: ../../grpo_math_1B.yaml
+grpo:
+  num_prompts_per_step: 64
+  max_num_steps: 10
+checkpointing:
+  checkpoint_dir: results/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
+  metric_name: val:reward
+policy:
+  # This is the model name is unusable because the model did bot update on huggingface yet.
+  # ISSUE: https://github.com/NVIDIA-NeMo/RL/issues/1571
+  model_name: nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 
+  max_total_sequence_length: 32768
+  train_global_batch_size: 64
+  train_micro_batch_size: 1
+  logprob_batch_size: 1
+  dtensor_cfg:
+    activation_checkpointing: true
+    context_parallel_size: 4
+    cpu_offload: true
+    tensor_parallel_size: 8
+    custom_parallel_plan: examples.custom_parallel.llama_nemotron_super_49b_custom_plan.custom_parallel_plan
+  dynamic_batching:
+    enabled: true
+  sequence_packing:
+    enabled: false
+  optimizer:
+    kwargs:
+      lr: 3.0e-07
+  scheduler:
+  - name: torch.optim.lr_scheduler.LinearLR
+    kwargs:
+      start_factor: 0.1
+      end_factor: 1.0
+      total_iters: 13
+  - name: torch.optim.lr_scheduler.ConstantLR
+    kwargs:
+      factor: 1.0
+      total_iters: 10000000000
+  - milestones:
+    - 13
+  generation:
+    vllm_cfg:
+      tensor_parallel_size: 4
+data:
+  # Training with HelpSteer3 will lead to high logprob error.
+  # ISSUE: https://github.com/NVIDIA-NeMo/RL/issues/1570
+  prompt_file: null
+  dataset_name: HelpSteer3
+  split: preference
+  env_name: "code_jaccard"
+  processor: helpsteer3_data_processor
+env:
+  code_jaccard:
+    num_workers: 8
+logger:
+  wandb_enabled: true
+  wandb:
+    project: grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
+    name: grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-tp${policy.dtensor_cfg.tensor_parallel_size}
+  tensorboard:
+    log_dir: tb_logs-grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
+  mlflow:
+    experiment_name: grpo-helpsteer3
+    run_name: grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5
+cluster:
+  gpus_per_node: 8
+  num_nodes: 16