NVIDIA-NeMo · terrykong · Feb 7, 2026 · Jan 22, 2026 · Jan 22, 2026 · Jan 22, 2026
@@ -0,0 +1,205 @@
+# An In-Depth Walkthrough of ProRLv2 in NeMo RL
+
+This guide covers the ProRLv2 configuration pattern in NeMo RL, based on the example config [`examples/configs/prorlv2.yaml`](../../examples/configs/prorlv2.yaml).
+
+ProRLv2 (as used in this repo) is best thought of as **GRPO and a bundle of stability/efficiency techniques** commonly used for long-horizon RL fine-tuning
+
+- **DAPO dynamic sampling**: skip prompt-groups with zero reward variance
+- **Decoupled (asymmetric) clipping**: `ratio_clip_max > ratio_clip_min`
+- **Token-level policy gradient loss**
+- **Importance sampling correction and TIS/CE-POP** (especially helpful for MoE/backend-mismatch scenarios)
+- **Reinforce++: Decoupled local/global advantage normalization** (`reinforce_plus_plus`)
+- **“Stop properly” penalty** for truncated responses
+
+This document focuses on ProRLv2-specific knobs and gotchas. For foundational concepts on GRPO (data, environments, generation backends, loss/metrics), see the [NeMo RL GRPO Guide](grpo.md). For the original DAPO motivation behind dynamic sampling/overlong shaping, see the [NeMo RL DAPO Guide](dapo.md).
+
+## Quickstart: Launch a ProRLv2 Run
+
+Use the example configuration [`examples/configs/prorlv2.yaml`](../../examples/configs/prorlv2.yaml):
+
+```bash
+uv run examples/run_grpo_math.py --config examples/configs/prorlv2.yaml {overrides}
+```
+
+`prorlv2.yaml` inherits from [`examples/configs/grpo_math_1B.yaml`](../../examples/configs/grpo_math_1B.yaml) and only overrides a small set of fields under `grpo` and `loss_fn`, plus output directories.
+
+**Reminder**: Don’t forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You’ll need to do a `huggingface-cli login` as well for gated models.
+
+## DAPO: Dynamic Sampling
+
+Standard GRPO will train on all generated responses, even when a prompt’s `num_generations_per_prompt` responses all receive the same reward (no per-prompt learning signal). **Dynamic sampling** filters to keep only prompt-groups with diverse rewards (`std > 0`), and can accumulate across multiple generation batches until it reaches the target rollout batch size.
+
+- **Config**: enable with `grpo.use_dynamic_sampling: true` and tune:
+  - `grpo.batch_multiplier`: how many extra prompts to generate to compensate filtering
+  - `grpo.dynamic_sampling_max_gen_batches`: upper bound before raising an error
+- **Implementation**: see `dynamic_sampling()` in [`nemo_rl/algorithms/grpo.py`](../../nemo_rl/algorithms/grpo.py).
+
+## Advantage Estimator: Reinforce++
+
+The ProRLv2 recipe uses **Reinforce++** advantage estimation instead of the standard GRPO-style group baseline.
+
+Quick intuition:
+
+- Reinforce++ uses **decoupled local + global normalization**.
+- Compared to GRPO-style **local-only normalization**, this decoupling can be **more stable** in longer runs (less sensitivity to per-batch scale/variance shifts).
+
+Computation (as implemented in this repo, with the ProRLv2 example defaults):
+
+```text
+Defaults in examples/configs/prorlv2.yaml:
+  grpo.adv_estimator.minus_baseline = true
+  loss_fn.use_kl_in_reward          = false
+
+Steps:
+  1) Per prompt-group, compute mean reward, then subtract it:
+       a_i = r_i - mean_{j in same prompt} r_j
+
+  2) Global normalize across *all valid response tokens* in the batch:
+       A <- (A - mean(A)) / sqrt(max(var(A), 1e-8))
+```
+
+```yaml
+grpo:
+  adv_estimator:
+    name: "reinforce_plus_plus"
+    normalize_rewards: true
+    use_leave_one_out_baseline: false
+    minus_baseline: true
+```
+
+- **Config**: `grpo.adv_estimator.name: "reinforce_plus_plus"`
+- **Implementation**: the training loop wires this via `ReinforcePlusPlusAdvantageEstimator` in [`nemo_rl/algorithms/grpo.py`](../../nemo_rl/algorithms/grpo.py).
+- **Reference**: [REINFORCE++ paper](https://arxiv.org/abs/2501.03262)
+
+## Reward Shaping: “Stop properly” Penalty (Truncation Penalty)
+
+When a generation hits the max length without emitting EOS, many pipelines mark it as **truncated**. The “stop properly” penalty scales the reward for truncated samples:
+
+- `stop_properly_penalty_coef = 0.0`: truncated samples get **zero reward**
+- `stop_properly_penalty_coef = 1.0`: **no penalty** (keep original rewards)
+- Any value in \([0, 1]\) interpolates between the two.
+
+In the example config:
+
+```yaml
+grpo:
+  reward_shaping:
+    enabled: true
+    stop_properly_penalty_coef: 0.0
+```
+
+- **Implementation**: `apply_reward_shaping()` in [`nemo_rl/algorithms/reward_functions.py`](../../nemo_rl/algorithms/reward_functions.py).
+
+:::{important}
+In the current implementation, if `stop_properly_penalty_coef` is set (not `null`), `apply_reward_shaping()` **returns early** after applying truncation scaling. That means you **cannot** apply DAPO "overlong reward shaping" in the same run unless you set `stop_properly_penalty_coef: null` and provide the DAPO overlong parameters (`overlong_buffer_length`, `overlong_buffer_penalty`, `max_response_length`).
+:::
+
+## Loss: Decoupled (Asymmetric) Clipping
+
+ProRLv2 uses DAPO’s “decoupled clipping” idea by setting different lower/upper clip bounds:
+
+```yaml
+loss_fn:
+  ratio_clip_min: 0.2
+  ratio_clip_max: 0.27
+```
+
+This keeps PPO/GRPO-style clipping behavior but allows a larger expansion region than the contraction region, which can help exploration and reduce early collapse.
+
+- **Implementation**: `ClippedPGLossFn` documents decoupled clipping in [`nemo_rl/algorithms/loss_functions.py`](../../nemo_rl/algorithms/loss_functions.py).
+
+## Loss: Token-level Policy Gradient
+
+ProRLv2 enables token-level loss:
+
+```yaml
+loss_fn:
+  token_level_loss: true
+```
+
+This computes the policy gradient loss per token (under masking) instead of aggregating per sequence, which is often helpful for long CoT/variable-length rollouts.
+
+## Truncated Importance Sampling
+
+When training and generation backends differ (e.g., numerics, precision, MoE routing, or vLLM vs training framework), you may see a mismatch between:
+
+- `generation_logprobs` (logprobs under the generation backend that produced samples)
+- `prev_logprobs` (logprobs under the training framework policy)
+
+NeMo RL supports **importance sampling correction**, and ProRLv2’s example config turns it on together with **truncated importance sampling**.
+
+Quick intuition:
+
+- This is mainly useful for **MoE/backend mismatch** cases, where the generation backend and the training policy can disagree on logprobs.
+- We compute an importance weight from `prev_logprobs` (training policy) vs `generation_logprobs` (generator). **ICE-POP** drops outliers by zeroing weights outside \([min, max]\).
+- In the common setup of **one policy update per rollout batch** (i.e., minibatch equals the per-step rollout batch; no PPO multi-epoch reuse), the PPO/GRPO likelihood ratio term is effectively **1.0** at update time, so the main stability issue is the MoE/backend-mismatch importance weights.
+- “Online ICE-POP” here just means applying that ICE-POP filtering **during loss computation** on the current training batch.
+
+- **Reference**: [The Online IcePop Solution for MoE models](https://hijkzzz.notion.site/online-ice-pop)
+
+```yaml
+loss_fn:
+  use_importance_sampling_correction: true
+  truncated_importance_sampling_ratio: 5.0
+  truncated_importance_sampling_ratio_min: 0.5
+  truncated_importance_sampling_type: "icepop"
+```
+
+- **`use_importance_sampling_correction`**: enable token-level importance weights (must be `true` for truncated IS)
+- **`truncated_importance_sampling_ratio`**: upper bound (or upper threshold)
+- **`truncated_importance_sampling_ratio_min`**: lower bound used by ICE-POP filtering
+- **`truncated_importance_sampling_type`**:
+  - `"tis"`: clamp weights to `<= truncated_importance_sampling_ratio`
+  - `"icepop"`: set weights outside \([min, max]\) to zero (filter outliers)
+
+- **Implementation**: see `ClippedPGLossFn` init-time checks and logic in [`nemo_rl/algorithms/loss_functions.py`](../../nemo_rl/algorithms/loss_functions.py).
+
+## Full Example Config (Annotated)
+
+The ProRLv2 example config is intentionally small and relies on defaults from `grpo_math_1B.yaml`.
+
+- **Example config**: [`examples/configs/prorlv2.yaml`](../../examples/configs/prorlv2.yaml)
+- **Base defaults**: [`examples/configs/grpo_math_1B.yaml`](../../examples/configs/grpo_math_1B.yaml)
+
+## Practical Overrides
+
+A few common overrides when launching:
+
+```bash
+uv run examples/run_grpo_math.py \
+  --config examples/configs/prorlv2.yaml \
+  policy.model_name="Qwen/Qwen2.5-1.5B" \
+  logger.wandb_enabled=true \
+  logger.wandb.project="prorlv2-dev" \
+  checkpointing.checkpoint_dir="results/prorlv2" \
+  logger.log_dir="logs/prorlv2"
+```
+
+If you want to enable DAPO overlong reward shaping instead of stop-properly:
+
+```bash
+uv run examples/run_grpo_math.py \
+  --config examples/configs/prorlv2.yaml \
+  grpo.reward_shaping.stop_properly_penalty_coef=null \
+  grpo.reward_shaping.overlong_buffer_length=4096 \
+  grpo.reward_shaping.overlong_buffer_penalty=1.0 \
+  grpo.reward_shaping.max_response_length=20480
+```
+
+## What to Monitor
+
+In addition to task rewards/accuracy, a few stability signals are particularly useful with ProRLv2-style runs:
+
+- **Dynamic sampling efficiency**: if enabled, watch how often batches need multiple generation rounds (see `dapo.md` for detailed guidance).
+- **Training–generation mismatch**: `token_mult_prob_error`, `gen_kl_error`, `policy_kl_error`, `js_divergence_error` are computed in `ClippedPGLossFn` (see the [GRPO metrics section](grpo.md#metrics)).
+- **Truncation rate**: if high, either increase `policy.max_total_sequence_length`/`policy.generation.max_model_len` or relax truncation penalty (`stop_properly_penalty_coef`).
+
+## References
+
+- **ProRLv2 blog**: [Scaling LLM Reinforcement Learning with Prolonged Training using ProRL v2](https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/)
+- **DAPO**: [Decoupled Clip and Dynamic Sampling Policy Optimization](https://arxiv.org/pdf/2503.14476)
+- **GRPO**: [Group Relative Policy Optimization](https://arxiv.org/abs/2402.03300)
+- **REINFORCE++**: [REINFORCE++](https://arxiv.org/abs/2501.03262)
+- **DLER (stop properly penalty explanation)**: [DLER](https://arxiv.org/pdf/2510.15110)
+- **[NeMo RL GRPO Guide](grpo.md)**
+- **[NeMo RL DAPO Guide](dapo.md)**
@@ -209,6 +209,7 @@ adding-new-models.md
 guides/sft.md
 guides/dpo.md
 guides/dapo.md
+guides/prorlv2.md
 guides/grpo.md
 guides/grpo-deepscaler.md
 guides/grpo-sliding-puzzle.md

@@ -22,6 +22,15 @@ grpo:
     overlong_buffer_length: 128
     overlong_buffer_penalty: 1
     max_response_length: ${policy.max_total_sequence_length}
+    stop_properly_penalty_coef: null
+
+  # Advantage Estimator Configuration
+  # Options: "grpo" (default) or "reinforce_plus_plus"
+  adv_estimator:
+    name: "grpo"  # Use "reinforce_plus_plus" for Reinforce++ estimator
+    normalize_rewards: true
+    use_leave_one_out_baseline: false
+    minus_baseline: true  # Reinforce++-baseline specific: subtract per-prompt mean baseline
   reward_scaling:
     enabled: false
     source_min: 0.0
@@ -52,9 +61,12 @@ loss_fn:
   # Set to true when async_grpo.enabled is true
   use_importance_sampling_correction: false
   truncated_importance_sampling_ratio: null
+  truncated_importance_sampling_ratio_min: null  # Lower bound for ICE-POP
+  truncated_importance_sampling_type: tis  # "tis" (clamp to max) or "icepop" (filter outside [min, max])
   sequence_level_importance_ratios: false
   token_level_loss: true
   force_on_policy_ratio: false  # Set to true to force ratio=1.0 (requires train_global_batch_size == num_prompts_per_step * num_generations_per_prompt)
+  use_kl_in_reward: false  # Reinforce++: add KL penalty to reward instead of loss
 
 checkpointing:
   enabled: true

@@ -0,0 +1,106 @@
+# ProRLv2 Algorithm Configuration
+# 
+# This configuration implements ProRLv2 with TIS techniques:
+# - Dynamic Sampling: Filter prompts with zero reward variance
+# - Decoupled Clipping: Asymmetric ratio clipping (clip_max > clip_min)
+# - Token-level Loss: Fine-grained policy gradient
+# - Truncated Importance Sampling (TIS) / IcePop for MoE models
+# - REINFORCE++: Decoupled local and global advantage normalization estimator
+# - Stop properly penalty: Reward scale coefficient for truncated responses
+#
+# Inherits from grpo_math_1B.yaml
+#
+# Usage:
+#   python examples/run_grpo_math.py --config examples/configs/prorlv2.yaml
+#
+# Reference papers and blogs:
+# ProRLv2: https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/
+# REINFORCE++: https://arxiv.org/abs/2501.03262
+# The Online IcePop Solution for MoE models: https://hijkzzz.notion.site/online-ice-pop
+# DLER (for Stop properly penalty): https://arxiv.org/pdf/2510.15110
+
+defaults: "grpo_math_1B.yaml"
+
+grpo:
+  # ============================================================================
+  # DAPO: Dynamic Sampling
+  # Filter out prompts where all generations have the same reward (std=0)
+  # This focuses training on "learnable" examples with mixed outcomes
+  # ============================================================================
+  use_dynamic_sampling: true
+  dynamic_sampling_max_gen_batches: 10  # Max batches before error
+  batch_multiplier: 1.5  # Generate more prompts to account for filtering
+
+
+  # ============================================================================
+  # Advantage Estimator
+  # Options: "grpo" (default) or "reinforce_plus_plus"
+  # ============================================================================
+  adv_estimator:
+    name: "reinforce_plus_plus"  # Use "grpo" for standard GRPO
+    # Global normalization of rewards
+    normalize_rewards: true
+    use_leave_one_out_baseline: false
+    # Reinforce++-Baseline specific
+    minus_baseline: true
+
+  # ============================================================================
+  # Reward Shaping
+  # Applied to rewards before advantage calculation
+  # Includes DAPO overlong penalty and stop properly penalty
+  # ============================================================================
+  reward_shaping:
+    enabled: true
+    # Stop properly penalty: scale factor for truncated responses (0-1)
+    # 0 = zero reward for truncated (default), 1 = no penalty
+    stop_properly_penalty_coef: 0.0  # Set to e.g., 0.1 to halve truncated rewards
+
+# ============================================================================
+# Loss Function Configuration
+# ============================================================================
+loss_fn:
+  # KL regularization
+  reference_policy_kl_penalty: 0.0001
+  reference_policy_kl_type: "k2"
+  kl_input_clamp_value: 20.0
+  kl_output_clamp_value: 10.0
+
+  # ============================================================================
+  # DAPO: Decoupled (Asymmetric) Clipping
+  # ratio_clip_max > ratio_clip_min allows more exploration
+  # Standard PPO uses symmetric clipping (both = 0.2)
+  # ============================================================================
+  ratio_clip_min: 0.2
+  ratio_clip_max: 0.27  # Slightly larger for exploration
+
+  # Dual-clipping (set to e.g., 3.0 to enable, null to disable)
+  ratio_clip_c: null
+
+  # ============================================================================
+  # DAPO: Token-level Loss
+  # Compute loss per-token instead of per-sequence
+  # ============================================================================
+  token_level_loss: true
+
+  # ============================================================================
+  # Truncated Importance Sampling (TIS / ICE-POP)
+  # Requires use_importance_sampling_correction: true
+  # ============================================================================
+  use_importance_sampling_correction: true
+  truncated_importance_sampling_ratio: 5.0  # Upper bound
+  truncated_importance_sampling_ratio_min: 0.5  # Lower bound (ICE-POP only)
+  # Type: "tis" (clamp to max) or "icepop" (filter outside [min, max])
+  truncated_importance_sampling_type: "icepop"
+
+  # Reinforce++: add KL penalty to reward instead of loss
+  # Set to false to use external KL loss (reference_policy_kl_penalty) for better stability
+  use_kl_in_reward: false
+
+# ============================================================================
+# Output directories
+# ============================================================================
+checkpointing:
+  checkpoint_dir: "results/prorl"
+
+logger:
+  log_dir: "logs/prorl"
@@ -0,0 +1,29 @@
+defaults: ../../prorlv2.yaml
+grpo:
+  max_num_steps: 450
+checkpointing:
+  checkpoint_dir: results/prorlv2-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1
+policy:
+  model_name: Qwen/Qwen2.5-Math-1.5B-Instruct
+  tokenizer:
+    name: Qwen/Qwen2.5-Math-1.5B-Instruct
+  dynamic_batching:
+    enabled: true
+  sequence_packing:
+    enabled: false
+  make_sequence_length_divisible_by: 1
+  generation:
+    max_new_tokens: 512
+    vllm_cfg:
+      max_model_len: 512
+data:
+  max_input_seq_length: 512
+logger:
+  log_dir: logs/prorlv2-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1
+  wandb_enabled: true
+  tensorboard_enabled: true
+  wandb:
+    project: nemo-rl
+    name: prorlv2-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1
+cluster:
+  gpus_per_node: 8
@@ -22,6 +22,13 @@ grpo:
     overlong_buffer_length: 512
     overlong_buffer_penalty: 1
     max_response_length: ${policy.max_total_sequence_length}
+  # Advantage Estimator Configuration
+  # Options: "grpo" (default) or "reinforce_plus_plus"
+  adv_estimator:
+    name: "grpo"  # Use "reinforce_plus_plus" for Reinforce++ estimator
+    normalize_rewards: true
+    use_leave_one_out_baseline: false
+    minus_baseline: true  # Reinforce++-baseline specific: subtract per-prompt mean baseline
   reward_scaling:
     enabled: false
     source_min: 0.0