feat: Implement ProRLv2 recipe by hijkzzz · Pull Request #1809 · NVIDIA-NeMo/RL

hijkzzz · 2026-01-22T09:30:19Z

REINFORCE++-baseline advantage estimator
Length Penalty
ICEPOP

Summary by CodeRabbit

New Features
- Added ProRL configuration for reinforcement learning training with dynamic sampling and asymmetric clipping.
- Introduced configurable advantage estimation methods supporting both GRPO and Reinforce++ variants.
- Added truncated importance sampling (TIS/ICE-POP) support for improved off-policy corrections.
- Integrated KL penalty support in reward calculation for enhanced training.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-22T09:33:16Z

📝 Walkthrough

Walkthrough

This PR introduces ProRL support with Reinforce++ advantage estimation and ICE-POP importance sampling filtering. It adds a new configuration file, implements configurable advantage estimators, refactors grpo.py to defer advantage computation until logprobs are available, and extends loss functions with truncated importance sampling options.

Changes

Cohort / File(s)	Summary
ProRL Configuration `examples/configs/prorl.yaml`	New ProRL v2 configuration inheriting from grpo_math_1B.yaml, enabling dynamic sampling (max 10 batches, 1.5× multiplier), Reinforce++ advantage estimation with optional baseline handling, reward shaping with stop-penalty, and ICE-POP truncated importance sampling with asymmetric clipping bounds (0.2–0.27).
Advantage Estimation `nemo_rl/algorithms/advantage_estimator.py`	New module implementing GRPOAdvantageEstimator (leave-one-out baseline with optional reward normalization) and ReinforcePlusPlusAdvantageEstimator (per-prompt baseline subtraction with optional KL penalty integration and global batch normalization).
GRPO Training Flow `nemo_rl/algorithms/grpo.py`	Adds AdvEstimatorConfig and LengthPenaltyConfig TypedDicts; extends GRPOConfig to include adv_estimator and length_penalty; refactors advantage computation to defer until logprobs availability; initializes appropriate estimator at runtime with validation.
Loss Function & IS `nemo_rl/algorithms/loss_functions.py`	Extends ClippedPGLossConfig with truncated_importance_sampling_type (tis/icepop modes), truncated_importance_sampling_ratio_min, and use_kl_in_reward flag; implements ICE-POP filtering to mask importance weights outside valid range and adds validation assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Sampler as Dynamic Sampler
    participant AdvEst as Advantage Estimator
    participant Reward as Reward Shaping
    participant Loss as Loss Function

    Sampler->>Sampler: Filter non-informative prompts<br/>(max 10 batches, 1.5x multiplier)
    activate Sampler
    Sampler->>AdvEst: Pass prompt_ids, rewards, mask
    deactivate Sampler
    
    activate AdvEst
    alt Reinforce++ Path
        AdvEst->>AdvEst: Subtract per-prompt baseline (optional)
        AdvEst->>AdvEst: Integrate KL penalty into reward<br/>(if use_kl_in_reward enabled)
    else GRPO Path
        AdvEst->>AdvEst: Calculate leave-one-out baseline
        AdvEst->>AdvEst: Normalize rewards by std (optional)
    end
    AdvEst->>AdvEst: Apply global batch normalization
    AdvEst->>Reward: Return standardized advantages
    deactivate AdvEst
    
    activate Reward
    Reward->>Reward: Apply stop-penalty shaping
    Reward->>Loss: Pass shaped rewards
    deactivate Reward
    
    activate Loss
    Loss->>Loss: Compute per-token loss with<br/>asymmetric clipping (0.2–0.27)
    Loss->>Loss: Apply Truncated Importance Sampling<br/>(TIS: clamp | ICE-POP: filter)
    Loss->>Loss: Optionally adjust with KL penalties
    Loss->>Loss: Output final loss
    deactivate Loss

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

RL#1423: Modifies advantage computation/normalization paths with baseline and std calculation refactoring affecting grpo.py and utils.py
RL#1348: Adds truncated importance sampling ratio clipping to ClippedPGLossFn, extended here with ICE-POP filtering and type selection

Suggested labels

CI:L2, r0.4.0

Suggested reviewers

terrykong

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR introduces 300+ lines of code changes (advantage estimator, length penalty, ICEPOP) with no test results or validation information in description.	Add test results, performance metrics, and numerical validation information to demonstrate correctness and convergence of the new advantage estimators and loss functions.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Implement ProRLv2 recipe' accurately reflects the main changes: introducing a ProRL v2 configuration file with new advantage estimators (REINFORCE++), loss function enhancements (ICEPOP/TIS), and supporting infrastructure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@examples/configs/prorl.yaml`:
- Line 40: The YAML boolean for the key use_leave_one_out_baseline is misspelled
as "fasle" causing parse errors; update the value to the correct boolean literal
"false" for use_leave_one_out_baseline in the prorl config so the YAML parses
and the flag behaves as expected.

In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 20-21: The class docstring for ReinforcePlusPlusAdvantageEstimator
contains a duplicated phrase "KL penalty in reward and KL penalty in reward";
update the docstring in the ReinforcePlusPlusAdvantageEstimator declaration to
remove the duplicate and make the wording concise (e.g., "Reinforce++ with
optional baseline subtraction (minus_baseline) and KL penalty in reward"),
ensuring the docstring accurately describes the minus_baseline and KL penalty
behavior.
- Around line 91-98: The code adds a KL penalty without validating KL config;
update the guard in the block using use_kl_in_reward so you only call
calculate_kl when kl_coef and kl_type are not None (e.g., change the condition
to check self.kl_coef is not None and self.kl_type is not None in addition to
self.use_kl_in_reward and non-None logprobs), and if either is missing either
raise a clear ValueError mentioning self.kl_coef/self.kl_type or skip applying
the KL term to adv; reference the symbols use_kl_in_reward, kl_coef, kl_type,
calculate_kl, and adv when making the change.

🧹 Nitpick comments (6)

nemo_rl/algorithms/loss_functions.py (1)
141-148: Default values in code contradict coding guidelines.

Per coding guidelines: "YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values." The defaults "tis" and 0.5 should be defined in YAML configs, not here.
Proposed fix
-        # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max])
-        self.truncated_importance_sampling_type = cfg.get(
-            "truncated_importance_sampling_type", "tis"
-        )
-        # Lower bound for ICE-POP filtering (default 0.5)
-        self.truncated_importance_sampling_ratio_min = cfg.get(
-            "truncated_importance_sampling_ratio_min", 0.5
-        )
+        # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max])
+        self.truncated_importance_sampling_type = cfg.get(
+            "truncated_importance_sampling_type"
+        )
+        # Lower bound for ICE-POP filtering
+        self.truncated_importance_sampling_ratio_min = cfg.get(
+            "truncated_importance_sampling_ratio_min"
+        )
Then ensure all YAML configs that use truncated_importance_sampling_ratio also specify truncated_importance_sampling_type and truncated_importance_sampling_ratio_min. Based on coding guidelines.
nemo_rl/algorithms/advantage_estimator.py (2)
67-72: Default values in code contradict coding guidelines.

Similar to loss_functions.py, defaults like True, 0.01, and "k3" should come from YAML configs per coding guidelines.
Proposed fix
     def __init__(self, estimator_config: dict, loss_config: dict):
-        self.minus_baseline = estimator_config.get("minus_baseline", True)
-        self.use_kl_in_reward = loss_config.get("use_kl_in_reward", False)
-        self.kl_coef = loss_config.get("reference_policy_kl_penalty", 0.01)
-        self.kl_type = loss_config.get("reference_policy_kl_type", "k3")
+        self.minus_baseline = estimator_config.get("minus_baseline")
+        self.use_kl_in_reward = loss_config.get("use_kl_in_reward")
+        self.kl_coef = loss_config.get("reference_policy_kl_penalty")
+        self.kl_type = loss_config.get("reference_policy_kl_type")
Based on coding guidelines.

100-106: Global normalization may have numerical issues with small masks.

mask.sum() could theoretically be zero or very small, leading to division issues. Consider adding a minimum clamp similar to the variance clamp.
Proposed fix
         # global normalization across the batch
-        adv_mean = (adv * mask).sum() / mask.sum()
-        adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask.sum()
+        mask_sum = mask.sum().clamp(min=1)
+        adv_mean = (adv * mask).sum() / mask_sum
+        adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask_sum
         adv_rstd = adv_var.clamp(min=1e-8).rsqrt()
         adv = (adv - adv_mean) * adv_rstd
nemo_rl/algorithms/grpo.py (2)
1113-1129: Duplicated adv_estimator initialization code.

The same initialization logic appears in both grpo_train (lines 1113-1129) and async_grpo_train (lines 2120-2136). Extract to a helper function to follow DRY principle.
Proposed refactor
def _create_advantage_estimator(master_config: MasterConfig):
    """Create advantage estimator based on configuration."""
    adv_estimator_config = master_config["grpo"].get("adv_estimator", {})
    adv_estimator_config.setdefault("name", "grpo")
    adv_estimator_config.setdefault("use_leave_one_out_baseline", master_config["grpo"]["use_leave_one_out_baseline"])
    adv_estimator_config.setdefault("normalize_rewards", master_config["grpo"]["normalize_rewards"])
    loss_config = master_config["loss_fn"]

    adv_estimator_name = adv_estimator_config["name"]
    if adv_estimator_name == "grpo":
        print(f"  ✓ Using GRPO advantage estimator")
        return GRPOAdvantageEstimator(adv_estimator_config, loss_config)
    elif adv_estimator_name == "reinforce_plus_plus":
        print(f"  ✓ Using Reinforce++ advantage estimator")
        return ReinforcePlusPlusAdvantageEstimator(adv_estimator_config, loss_config)
    else:
        raise ValueError(f"Invalid adv_estimator name: {adv_estimator_name}")
1397-1406: Placeholder advantages with shape mismatch.

The placeholder advantages has shape (batch_size, 1) but later gets expanded to message["token_ids"].shape in line 1437-1438. This works but the intermediate del advantages on line 1440 happens before the actual advantages are computed, then new advantages are computed in lines 1511-1517. The flow is correct but the variable reuse is confusing.

Consider renaming the placeholder to placeholder_advantages for clarity, or restructuring so the placeholder isn't deleted before the real computation.
examples/configs/prorl.yaml (1)

89-90: use_kl_in_reward: false but Reinforce++ selected.

With adv_estimator.name: "reinforce_plus_plus" and use_kl_in_reward: false, the KL penalty will be in the loss (not reward). This is valid but worth noting that the "full" Reinforce++ experience typically uses KL in reward. Consider adding a comment clarifying this choice.

examples/configs/prorl.yaml

nemo_rl/algorithms/advantage_estimator.py

joyang-nv · 2026-01-23T00:10:50Z

@hijkzzz . Thanks for contribution. I think @yuki-97 has a draft PR with similar contents before. Is this improvement or fresh start work? If you are resuming her work. Can you add her to Co-authored-By in the initial commit?

hijkzzz · 2026-01-23T04:23:44Z

@joyang-nv @yuki-97 doesn’t have a complete implementation of ProRL v2… and it’s not based on the latest codebase.

joyang-nv · 2026-01-23T06:35:11Z

@hijkzzz . You are right. @yuki-97 can't finish where she planed according to priorities from management team. And look like you are resuming the rest. :)
Only suggestion is that if you are based on @yuki-97 's previous work, just leave her name in the initial commit which contains her previous work. I will always do this for all developers.

hijkzzz · 2026-01-23T08:25:16Z

@joyang-nv This implementation wasn’t developed based on @yuki-97 ’s MR. This was rewritten from scratch and is still being debugged.
Of course, @yuki-97 is also welcome to continue and complete the MR that wasn’t finished earlier.

hijkzzz · 2026-01-23T23:30:46Z

@terrykong

Training logs:

joyang-nv · 2026-01-26T08:31:09Z

Hi @hijkzzz :

Please note that you are the author of ProRL and @yuki-97 is also contributing to ProRL and landing to Nemo RL. And Yuki was pulled to this contribution due to my request last year. She spent +1 month dedicated efforts on this. And unfortunately, I pulled her on other high priority tasks this year. But I still want to recognize her efforts on ProRL to Nemo RL. I am also thanking for your contribution to ProRL and Nemo RL as well.

Here is list of work Yuki has done and pending tasks:
Merged:

Pending:

Rf++: yuki-97@04854fb
Stop properly penalty: yuki-97@2b3e4d6
Adv boost: yuki-97@04232c2
Split val sampling args: yuki-97@7fc515a
Fix vLLM fail when high concurrent: yuki-97@bbc5a58

Dev branches:

Integrate ProRL into NeMo-RL: https://github.com/yuki-97/NeMo-RL-ProRL/commits/yukih/prorl/
Integrate ProRL into nanov3

examples/configs/prorlv2.yaml

nemo_rl/algorithms/advantage_estimator.py

hijkzzz · 2026-01-26T23:52:46Z

@joyang-nv This MR was redeveloped from scratch and wasn’t based on the previous code. Most of the time went into debugging rather than writing code; the code itself was completed in about an hour using Cursor.

nemo_rl/algorithms/advantage_estimator.py

nemo_rl/algorithms/grpo.py

nemo_rl/algorithms/advantage_estimator.py

nemo_rl/algorithms/reward_functions.py

nemo_rl/algorithms/advantage_estimator.py

nemo_rl/algorithms/grpo.py

nemo_rl/algorithms/reward_functions.py

Signed-off-by: jianh <jianh@nvidia.com>

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

- Disable dynamic sampling in prorlv2.sh to avoid sampling failures with small models (small models tend to produce uniform rewards, causing std=0 for most prompt groups) - Set batch_multiplier=1 (required when dynamic_sampling is disabled) - Remove default values for TIS config options to use None as default Signed-off-by: jianh <jianh@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

terrykong

lgtm. thanks for addressing review @hijkzzz. i've enqueued

Signed-off-by: jianh <jianh@nvidia.com>

terrykong · 2026-02-05T17:37:20Z

hi @hijkzzz . there was a functional test failure:

uv run --no-sync bash ./tests/functional/vlm_grpo.sh

2026-02-05T16:19:15.0130354Z ============================================================
2026-02-05T16:19:15.0130630Z                   SETUP COMPLETE
2026-02-05T16:19:15.0130861Z   Total setup time: 313.3s
2026-02-05T16:19:15.0131092Z ============================================================
2026-02-05T16:19:15.0131283Z 
2026-02-05T16:19:15.0131391Z Traceback (most recent call last):
2026-02-05T16:19:15.0131728Z   File "/opt/nemo-rl/examples/run_vlm_grpo.py", line 132, in <module>
2026-02-05T16:19:15.0132062Z     main()
2026-02-05T16:19:15.0132303Z   File "/opt/nemo-rl/examples/run_vlm_grpo.py", line 115, in main
2026-02-05T16:19:15.0132616Z     grpo_train(
2026-02-05T16:19:15.0132898Z   File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 1218, in grpo_train
2026-02-05T16:19:15.0133324Z     adv_estimator = _create_advantage_estimator(master_config)
2026-02-05T16:19:15.0133649Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-05T16:19:15.0134070Z   File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 987, in _create_advantage_estimator
2026-02-05T16:19:15.0134560Z     adv_estimator_config = master_config["grpo"]["adv_estimator"]
2026-02-05T16:19:15.0134898Z                            ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
2026-02-05T16:19:15.0135167Z KeyError: 'adv_estimator'

hijkzzz · 2026-02-06T02:15:57Z

hi @hijkzzz . there was a functional test failure:

uv run --no-sync bash ./tests/functional/vlm_grpo.sh

2026-02-05T16:19:15.0130354Z ============================================================
2026-02-05T16:19:15.0130630Z                   SETUP COMPLETE
2026-02-05T16:19:15.0130861Z   Total setup time: 313.3s
2026-02-05T16:19:15.0131092Z ============================================================
2026-02-05T16:19:15.0131283Z 
2026-02-05T16:19:15.0131391Z Traceback (most recent call last):
2026-02-05T16:19:15.0131728Z   File "/opt/nemo-rl/examples/run_vlm_grpo.py", line 132, in <module>
2026-02-05T16:19:15.0132062Z     main()
2026-02-05T16:19:15.0132303Z   File "/opt/nemo-rl/examples/run_vlm_grpo.py", line 115, in main
2026-02-05T16:19:15.0132616Z     grpo_train(
2026-02-05T16:19:15.0132898Z   File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 1218, in grpo_train
2026-02-05T16:19:15.0133324Z     adv_estimator = _create_advantage_estimator(master_config)
2026-02-05T16:19:15.0133649Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-05T16:19:15.0134070Z   File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 987, in _create_advantage_estimator
2026-02-05T16:19:15.0134560Z     adv_estimator_config = master_config["grpo"]["adv_estimator"]
2026-02-05T16:19:15.0134898Z                            ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
2026-02-05T16:19:15.0135167Z KeyError: 'adv_estimator'

hi @hijkzzz . there was a functional test failure:

uv run --no-sync bash ./tests/functional/vlm_grpo.sh

2026-02-05T16:19:15.0130354Z ============================================================
2026-02-05T16:19:15.0130630Z                   SETUP COMPLETE
2026-02-05T16:19:15.0130861Z   Total setup time: 313.3s
2026-02-05T16:19:15.0131092Z ============================================================
2026-02-05T16:19:15.0131283Z 
2026-02-05T16:19:15.0131391Z Traceback (most recent call last):
2026-02-05T16:19:15.0131728Z   File "/opt/nemo-rl/examples/run_vlm_grpo.py", line 132, in <module>
2026-02-05T16:19:15.0132062Z     main()
2026-02-05T16:19:15.0132303Z   File "/opt/nemo-rl/examples/run_vlm_grpo.py", line 115, in main
2026-02-05T16:19:15.0132616Z     grpo_train(
2026-02-05T16:19:15.0132898Z   File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 1218, in grpo_train
2026-02-05T16:19:15.0133324Z     adv_estimator = _create_advantage_estimator(master_config)
2026-02-05T16:19:15.0133649Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-05T16:19:15.0134070Z   File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 987, in _create_advantage_estimator
2026-02-05T16:19:15.0134560Z     adv_estimator_config = master_config["grpo"]["adv_estimator"]
2026-02-05T16:19:15.0134898Z                            ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
2026-02-05T16:19:15.0135167Z KeyError: 'adv_estimator'

So we need to use default value here.

Signed-off-by: jianh <jianh@nvidia.com>

hijkzzz · 2026-02-06T20:37:34Z

@terrykongt please merge it

terrykong · 2026-02-06T21:12:45Z

okay, so good thing is the CI is all passing. @hijkzzz there was a merge conflict due to a recently landed pr for spec decoding. are you able to resolve?

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # nemo_rl/algorithms/grpo.py

hijkzzz · 2026-02-06T21:32:45Z

@terrykong just merge it without l1 tests.

terrykong · 2026-02-07T01:26:40Z

this one previously failed on l1 tests, so it is prudent to run them again.

hijkzzz · 2026-02-07T02:01:18Z

It can be merged directly; otherwise, there will be endless conflicts, because the new updates are only conflicts from merging main. The previous L1 failure was due to the adv estimator not having a default value.

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # tests/unit/test_recipes_and_test_suites.py

terrykong · 2026-02-07T04:16:37Z

@hijkzzz i understand that dealing with conflicts and having to update code is inconvenient. but is also the nature of software development. if we skip the tests now and they later fail in our nightlies, it becomes someone else's problem to fix the test despite us knowing something is wrong in this PR

i've asked other developers to pause ones that may conflict. please fix the known failures and let's land this PR

thanks again for working through this

hijkzzz · 2026-02-07T07:46:23Z

ok i will fix it this weekend

Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: jianh <jianh@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

hijkzzz requested a review from a team as a code owner January 22, 2026 09:30

hijkzzz force-pushed the jianh/prorl branch from c70b24a to 99bde72 Compare January 22, 2026 09:32

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

examples/configs/prorl.yaml Outdated Show resolved Hide resolved

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

nemo_rl/algorithms/advantage_estimator.py Show resolved Hide resolved

hijkzzz changed the title ~~[DRAFT] feat: Implement ProRL recipe~~ feat: Implement ProRL recipe Jan 23, 2026

hijkzzz changed the title ~~feat: Implement ProRL recipe~~ feat: Implement ProRLv2 recipe Jan 23, 2026

hijkzzz requested review from a team as code owners January 23, 2026 09:58

yfw reviewed Jan 26, 2026

View reviewed changes

examples/configs/prorlv2.yaml Show resolved Hide resolved

yfw reviewed Jan 26, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

yfw reviewed Jan 26, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/reward_functions.py Show resolved Hide resolved

hijkzzz requested a review from a team as a code owner January 27, 2026 03:04

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Show resolved Hide resolved

hijkzzz force-pushed the jianh/prorl branch from cf18fa7 to 2b34882 Compare January 27, 2026 03:49

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/grpo.py Show resolved Hide resolved

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/reward_functions.py Show resolved Hide resolved

hijkzzz requested a review from a team as a code owner January 27, 2026 05:03

init prorl

80c2bb2

Signed-off-by: jianh <jianh@nvidia.com>

hijkzzz and others added 9 commits February 4, 2026 09:07

Update docs/guides/prorlv2.md

6138bd7

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

Update docs/guides/prorlv2.md

6ee66bc

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

Update docs/guides/prorlv2.md

790240a

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

Update docs/guides/prorlv2.md

ffbc1a4

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

Update docs/guides/prorlv2.md

dd6e831

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

Update docs/guides/prorlv2.md

2b9afb3

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

Merge branch 'main' into jianh/prorl

8321cb6

Apply suggestion from @jgerh

f66ecd7

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: hijkzzz <janhu9527@gmail.com>

terrykong previously approved these changes Feb 4, 2026

View reviewed changes

hijkzzz and others added 2 commits February 5, 2026 09:11

Merge branch 'main' into jianh/prorl

0152ffa

fix test

7b7957d

Signed-off-by: jianh <jianh@nvidia.com>

terrykong previously approved these changes Feb 5, 2026

View reviewed changes

hijkzzz and others added 2 commits February 6, 2026 10:52

Merge branch 'main' into jianh/prorl

abc73a2

fix configs

3593dcf

Signed-off-by: jianh <jianh@nvidia.com>

Merge remote-tracking branch 'origin/main' into jianh/prorl

1a514fa

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # nemo_rl/algorithms/grpo.py

terrykong previously approved these changes Feb 7, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into jianh/prorl

1abf129

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # tests/unit/test_recipes_and_test_suites.py

fix: bump nightly GPU hours threshold to 1300

3bb1d0f

Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: jianh <jianh@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

terrykong approved these changes Feb 7, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Feb 11, 2026

feat: skip logprob and reference logprob computation under certain conditions #1891

Open

4 tasks

Conversation

hijkzzz commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joyang-nv commented Jan 23, 2026

Uh oh!

hijkzzz commented Jan 23, 2026

Uh oh!

joyang-nv commented Jan 23, 2026

Uh oh!

hijkzzz commented Jan 23, 2026

Uh oh!

hijkzzz commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joyang-nv commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

hijkzzz commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong commented Feb 5, 2026

Uh oh!

hijkzzz commented Feb 6, 2026

Uh oh!

hijkzzz commented Feb 6, 2026

Uh oh!

terrykong commented Feb 6, 2026

Uh oh!

hijkzzz commented Feb 6, 2026

Uh oh!

terrykong commented Feb 7, 2026

Uh oh!

hijkzzz commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrykong commented Feb 7, 2026

Uh oh!

hijkzzz commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hijkzzz commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

hijkzzz commented Jan 23, 2026 •

edited

Loading

hijkzzz commented Feb 7, 2026 •

edited

Loading