feat: Implement ProRLv2 recipe#1809
Conversation
📝 WalkthroughWalkthroughThis PR introduces ProRL support with Reinforce++ advantage estimation and ICE-POP importance sampling filtering. It adds a new configuration file, implements configurable advantage estimators, refactors grpo.py to defer advantage computation until logprobs are available, and extends loss functions with truncated importance sampling options. Changes
Sequence Diagram(s)sequenceDiagram
participant Sampler as Dynamic Sampler
participant AdvEst as Advantage Estimator
participant Reward as Reward Shaping
participant Loss as Loss Function
Sampler->>Sampler: Filter non-informative prompts<br/>(max 10 batches, 1.5x multiplier)
activate Sampler
Sampler->>AdvEst: Pass prompt_ids, rewards, mask
deactivate Sampler
activate AdvEst
alt Reinforce++ Path
AdvEst->>AdvEst: Subtract per-prompt baseline (optional)
AdvEst->>AdvEst: Integrate KL penalty into reward<br/>(if use_kl_in_reward enabled)
else GRPO Path
AdvEst->>AdvEst: Calculate leave-one-out baseline
AdvEst->>AdvEst: Normalize rewards by std (optional)
end
AdvEst->>AdvEst: Apply global batch normalization
AdvEst->>Reward: Return standardized advantages
deactivate AdvEst
activate Reward
Reward->>Reward: Apply stop-penalty shaping
Reward->>Loss: Pass shaped rewards
deactivate Reward
activate Loss
Loss->>Loss: Compute per-token loss with<br/>asymmetric clipping (0.2–0.27)
Loss->>Loss: Apply Truncated Importance Sampling<br/>(TIS: clamp | ICE-POP: filter)
Loss->>Loss: Optionally adjust with KL penalties
Loss->>Loss: Output final loss
deactivate Loss
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@examples/configs/prorl.yaml`:
- Line 40: The YAML boolean for the key use_leave_one_out_baseline is misspelled
as "fasle" causing parse errors; update the value to the correct boolean literal
"false" for use_leave_one_out_baseline in the prorl config so the YAML parses
and the flag behaves as expected.
In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 20-21: The class docstring for ReinforcePlusPlusAdvantageEstimator
contains a duplicated phrase "KL penalty in reward and KL penalty in reward";
update the docstring in the ReinforcePlusPlusAdvantageEstimator declaration to
remove the duplicate and make the wording concise (e.g., "Reinforce++ with
optional baseline subtraction (minus_baseline) and KL penalty in reward"),
ensuring the docstring accurately describes the minus_baseline and KL penalty
behavior.
- Around line 91-98: The code adds a KL penalty without validating KL config;
update the guard in the block using use_kl_in_reward so you only call
calculate_kl when kl_coef and kl_type are not None (e.g., change the condition
to check self.kl_coef is not None and self.kl_type is not None in addition to
self.use_kl_in_reward and non-None logprobs), and if either is missing either
raise a clear ValueError mentioning self.kl_coef/self.kl_type or skip applying
the KL term to adv; reference the symbols use_kl_in_reward, kl_coef, kl_type,
calculate_kl, and adv when making the change.
🧹 Nitpick comments (6)
nemo_rl/algorithms/loss_functions.py (1)
141-148: Default values in code contradict coding guidelines.Per coding guidelines: "YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values." The defaults
"tis"and0.5should be defined in YAML configs, not here.Proposed fix
- # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max]) - self.truncated_importance_sampling_type = cfg.get( - "truncated_importance_sampling_type", "tis" - ) - # Lower bound for ICE-POP filtering (default 0.5) - self.truncated_importance_sampling_ratio_min = cfg.get( - "truncated_importance_sampling_ratio_min", 0.5 - ) + # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max]) + self.truncated_importance_sampling_type = cfg.get( + "truncated_importance_sampling_type" + ) + # Lower bound for ICE-POP filtering + self.truncated_importance_sampling_ratio_min = cfg.get( + "truncated_importance_sampling_ratio_min" + )Then ensure all YAML configs that use
truncated_importance_sampling_ratioalso specifytruncated_importance_sampling_typeandtruncated_importance_sampling_ratio_min. Based on coding guidelines.nemo_rl/algorithms/advantage_estimator.py (2)
67-72: Default values in code contradict coding guidelines.Similar to loss_functions.py, defaults like
True,0.01, and"k3"should come from YAML configs per coding guidelines.Proposed fix
def __init__(self, estimator_config: dict, loss_config: dict): - self.minus_baseline = estimator_config.get("minus_baseline", True) - self.use_kl_in_reward = loss_config.get("use_kl_in_reward", False) - self.kl_coef = loss_config.get("reference_policy_kl_penalty", 0.01) - self.kl_type = loss_config.get("reference_policy_kl_type", "k3") + self.minus_baseline = estimator_config.get("minus_baseline") + self.use_kl_in_reward = loss_config.get("use_kl_in_reward") + self.kl_coef = loss_config.get("reference_policy_kl_penalty") + self.kl_type = loss_config.get("reference_policy_kl_type")Based on coding guidelines.
100-106: Global normalization may have numerical issues with small masks.
mask.sum()could theoretically be zero or very small, leading to division issues. Consider adding a minimum clamp similar to the variance clamp.Proposed fix
# global normalization across the batch - adv_mean = (adv * mask).sum() / mask.sum() - adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask.sum() + mask_sum = mask.sum().clamp(min=1) + adv_mean = (adv * mask).sum() / mask_sum + adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask_sum adv_rstd = adv_var.clamp(min=1e-8).rsqrt() adv = (adv - adv_mean) * adv_rstdnemo_rl/algorithms/grpo.py (2)
1113-1129: Duplicated adv_estimator initialization code.The same initialization logic appears in both
grpo_train(lines 1113-1129) andasync_grpo_train(lines 2120-2136). Extract to a helper function to follow DRY principle.Proposed refactor
def _create_advantage_estimator(master_config: MasterConfig): """Create advantage estimator based on configuration.""" adv_estimator_config = master_config["grpo"].get("adv_estimator", {}) adv_estimator_config.setdefault("name", "grpo") adv_estimator_config.setdefault("use_leave_one_out_baseline", master_config["grpo"]["use_leave_one_out_baseline"]) adv_estimator_config.setdefault("normalize_rewards", master_config["grpo"]["normalize_rewards"]) loss_config = master_config["loss_fn"] adv_estimator_name = adv_estimator_config["name"] if adv_estimator_name == "grpo": print(f" ✓ Using GRPO advantage estimator") return GRPOAdvantageEstimator(adv_estimator_config, loss_config) elif adv_estimator_name == "reinforce_plus_plus": print(f" ✓ Using Reinforce++ advantage estimator") return ReinforcePlusPlusAdvantageEstimator(adv_estimator_config, loss_config) else: raise ValueError(f"Invalid adv_estimator name: {adv_estimator_name}")
1397-1406: Placeholder advantages with shape mismatch.The placeholder
advantageshas shape(batch_size, 1)but later gets expanded tomessage["token_ids"].shapein line 1437-1438. This works but the intermediatedel advantageson line 1440 happens before the actual advantages are computed, then new advantages are computed in lines 1511-1517. The flow is correct but the variable reuse is confusing.Consider renaming the placeholder to
placeholder_advantagesfor clarity, or restructuring so the placeholder isn't deleted before the real computation.examples/configs/prorl.yaml (1)
89-90:use_kl_in_reward: falsebut Reinforce++ selected.With
adv_estimator.name: "reinforce_plus_plus"anduse_kl_in_reward: false, the KL penalty will be in the loss (not reward). This is valid but worth noting that the "full" Reinforce++ experience typically uses KL in reward. Consider adding a comment clarifying this choice.
|
@hijkzzz . Thanks for contribution. I think @yuki-97 has a draft PR with similar contents before. Is this improvement or fresh start work? If you are resuming her work. Can you add her to Co-authored-By in the initial commit? |
|
@joyang-nv @yuki-97 doesn’t have a complete implementation of ProRL v2… and it’s not based on the latest codebase. |
|
@hijkzzz . You are right. @yuki-97 can't finish where she planed according to priorities from management team. And look like you are resuming the rest. :) |
|
@joyang-nv This implementation wasn’t developed based on @yuki-97 ’s MR. This was rewritten from scratch and is still being debugged. |
|
Hi @hijkzzz : Please note that you are the author of ProRL and @yuki-97 is also contributing to ProRL and landing to Nemo RL. And Yuki was pulled to this contribution due to my request last year. She spent +1 month dedicated efforts on this. And unfortunately, I pulled her on other high priority tasks this year. But I still want to recognize her efforts on ProRL to Nemo RL. I am also thanking for your contribution to ProRL and Nemo RL as well. Here is list of work Yuki has done and pending tasks: Pending:
Dev branches:
|
|
@joyang-nv This MR was redeveloped from scratch and wasn’t based on the previous code. Most of the time went into debugging rather than writing code; the code itself was completed in about an hour using Cursor. |
Signed-off-by: jianh <jianh@nvidia.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>
- Disable dynamic sampling in prorlv2.sh to avoid sampling failures with small models (small models tend to produce uniform rewards, causing std=0 for most prompt groups) - Set batch_multiplier=1 (required when dynamic_sampling is disabled) - Remove default values for TIS config options to use None as default Signed-off-by: jianh <jianh@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>
|
hi @hijkzzz . there was a functional test failure: |
So we need to use default value here. |
Signed-off-by: jianh <jianh@nvidia.com>
|
@terrykongt please merge it |
|
okay, so good thing is the CI is all passing. @hijkzzz there was a merge conflict due to a recently landed pr for spec decoding. are you able to resolve? |
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # nemo_rl/algorithms/grpo.py
|
@terrykong just merge it without l1 tests. |
|
this one previously failed on l1 tests, so it is prudent to run them again. |
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # tests/unit/test_recipes_and_test_suites.py
|
@hijkzzz i understand that dealing with conflicts and having to update code is inconvenient. but is also the nature of software development. if we skip the tests now and they later fail in our nightlies, it becomes someone else's problem to fix the test despite us knowing something is wrong in this PR i've asked other developers to pause ones that may conflict. please fix the known failures and let's land this PR thanks again for working through this |
|
ok i will fix it this weekend |
Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: jianh <jianh@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>




Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.