feat: enhance advantages tracking and normalization stability in GRPO by ffrujeri · Pull Request #1423 · NVIDIA-NeMo/RL

ffrujeri · 2025-10-24T16:40:30Z

What does this PR do ?

Improves advantages tracking and normalization in GRPO algorithms for better training stability and debugging.

Issues

Closes Track advantages on the RL algorithms #1395 - Track advantages on the RL algorithms
Related to Investigate if TP affects Acemath recipe #1292 - Investigate if TP affects Acemath recipe

Key Changes

1. Enhanced Advantage Tracking

Added metrics tracking for advantages/mean, advantages/max, and advantages/min

These metrics help with debugging unstable training runs by providing visibility into advantage distribution

2. Improved Advantage Normalization

New function: normalize_advantages_with_epsilon() that uses epsilon (1e-6) to avoid division by zero instead of masking
Replaces: Previous zero-standard-deviation masking approach in both sync and async GRPO
Benefits: More numerically stable and avoids potential issues with masked advantages

3. Enhanced Standard Deviation Calculation

Fixed: Potential numerical issues in calculate_baseline_and_std_per_prompt()
Added: More robust standard deviation computation with proper handling of edge cases

Usage

The advantage tracking metrics will automatically appear in your training logs when using GRPO. For example:

# These metrics will now be logged during training:
metrics = {
    "advantages/mean": 0.125,    # Mean advantage value
    "advantages/max": 2.456,     # Maximum advantage value  
    "advantages/min": -1.234,    # Minimum advantage value
    # ... other metrics
}

Before your PR is "Ready for review"

Pre checks:

[ x] Make sure you read and followed Contributor guidelines
[ x] Did you write any new necessary tests?
[ x] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
[ x] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

Debugging: The new advantages metrics are particularly helpful for debugging unstable runs in math training recipes
Numerical Stability: The epsilon-based normalization prevents division by zero errors that could occur with the previous masking approach
Backward Compatibility: All changes are backward compatible and don't affect existing configurations
Performance: No significant performance impact expected from the additional metric tracking

This improvement addresses stability issues that could affect tensor parallelism configurations in math training recipes, providing better observability and more robust normalization.

The comparison between baseline and treatment metrics can be seen in this report

In terms of perf time_per_step:

In terms of training_rewards:

Summary by CodeRabbit

Bug Fixes
- Improved numerical stability in advantage normalization during training by introducing epsilon-based handling to prevent division-by-zero errors in edge cases.
Tests
- Added comprehensive test coverage for advantage normalization functionality and baseline/standard deviation computations across various scenarios and tensor shapes.

coderabbitai · 2025-10-31T22:29:17Z

📝 Walkthrough

Walkthrough

The PR adds a new normalize_advantages_with_epsilon utility function for stable advantage normalization, refactors std calculation in utils, extends advantage normalization across multiple training paths, and adds logging metrics for advantages statistics (mean, max, min) to support debugging of RL training stability.

Changes

Cohort / File(s)	Summary
Core normalization and advantage tracking `nemo_rl/algorithms/grpo.py`	Added `normalize_advantages_with_epsilon` function for numerically stable advantage normalization; integrated into grpo_train, async_grpo_train, and dynamic reward processing paths; added logging metrics for advantages statistics (mean, max, min) at multiple training checkpoints
Baseline and std computation `nemo_rl/algorithms/utils.py`	Modified std tensor initialization and computation; added per-prompt std calculation within the prompt loop, then globally recompute std (overwriting per-prompt calculations); logic flow remains unchanged
Test coverage for normalization `tests/unit/algorithms/test_grpo.py`	Added five comprehensive test functions for `normalize_advantages_with_epsilon`: basic normalization, zero std handling, all-zero std, tensor shape variants, and negative advantages
Test coverage for baseline utils `tests/unit/algorithms/test_utils.py`	Added extensive tests for `calculate_baseline_and_std_per_prompt` covering multiple prompts, single generation, identical rewards, empty inputs, NaN handling, CUDA compatibility, and numerical precision edge cases

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:
- The per-prompt std calculation in utils.py is computed and then immediately overwritten by the global std calculation, indicating potential redundant or incomplete logic
- Ensure advantage logging doesn't introduce performance overhead in high-frequency training loops
- Verify epsilon value (1e-6) is appropriate across different reward scales and batch configurations
- Cross-check that all training paths (grpo_train, async_grpo_train, dynamic processing) consistently apply the new normalization

Suggested reviewers

terrykong
parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	This PR introduces major changes that directly affect numerics in GRPO training: a new `normalize_advantages_with_epsilon` function replaces previous masking logic, normalization is extended across multiple training paths (grpo_train, async_grpo_train, dynamic sampling), and `calculate_baseline_and_std_per_prompt` computation is modified. Since these changes affect core numerical operations that impact training stability and convergence, the PR should document test results demonstrating no regression. The PR includes comprehensive unit test coverage with 122 total assertions across 13 new test functions covering edge cases and numerical precision. However, the PR description only states "local unit and functional tests run" without providing actual test results, CI logs, performance benchmarks, or convergence comparisons. The automated CI/CD (L0 unit tests) will execute, but the check requires that test results be documented in the PR description itself for changes affecting numerics.	To pass this check, the PR description must be updated to include one of the following: (1) explicit test results or a link to the CI/CD execution logs showing all unit, functional, and convergence tests pass; (2) convergence benchmarks or training stability metrics demonstrating no regression from the numerical changes; or (3) performance comparison data before and after the changes in relevant configurations. Given that this PR modifies core numerical operations affecting training stability and convergence (as indicated in the linked issue #1395), documentation of test results within the PR description is necessary to meet the standard for major changes affecting numerics.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "feat: enhance advantages tracking and normalization stability in GRPO" directly and accurately reflects the main changes in the changeset. The title succinctly captures the two primary objectives: adding advantages statistics tracking (mean, max, min) and improving normalization stability through the new normalize_advantages_with_epsilon function. The title is specific, concise, and avoids vague or generic language, clearly communicating the essence of the changes from the developer's perspective.
Linked Issues Check	✅ Passed	The code changes directly address the requirement from linked issue #1395 to track advantages on RL algorithms. The implementation adds logging metrics for advantages/mean, advantages/max, and advantages/min across multiple GRPO training paths (grpo_train, async_grpo_train, and dynamic sampling branches), providing the requested visibility into advantage distribution for debugging unstable runs. Additionally, the new normalize_advantages_with_epsilon function and enhanced standard deviation computation in utils.py address the normalization stability improvements mentioned in the PR objectives, all of which are in service of the core tracking objective.
Out of Scope Changes Check	✅ Passed	All changes are directly aligned with the stated objectives of enhancing advantages tracking and normalization stability in GRPO. The modifications to nemo_rl/algorithms/grpo.py add the required metrics tracking and the new normalization function, modifications to nemo_rl/algorithms/utils.py improve the underlying standard deviation computation for numerical robustness, and the test additions appropriately provide coverage for both new and modified functions. No out-of-scope changes that deviate from the core objectives are evident in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ffrujeri/grpo_improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 855151b and 0dcc5b8.

📒 Files selected for processing (4)

nemo_rl/algorithms/grpo.py (5 hunks)
nemo_rl/algorithms/utils.py (2 hunks)
tests/unit/algorithms/test_grpo.py (2 hunks)
tests/unit/algorithms/test_utils.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/algorithms/grpo.py
tests/unit/algorithms/test_utils.py
tests/unit/algorithms/test_grpo.py
nemo_rl/algorithms/utils.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/algorithms/grpo.py
nemo_rl/algorithms/utils.py

🧬 Code graph analysis (3)

nemo_rl/algorithms/grpo.py (1)

tests/check_metrics.py (3)

mean (52-97)

max (30-32)

min (25-27)

tests/unit/algorithms/test_utils.py (1)

nemo_rl/algorithms/utils.py (1)

calculate_baseline_and_std_per_prompt (51-129)

tests/unit/algorithms/test_grpo.py (1)

nemo_rl/algorithms/grpo.py (1)

normalize_advantages_with_epsilon (535-551)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (6)

tests/unit/algorithms/test_utils.py (1)

397-595: Comprehensive test coverage for baseline and std calculations.

The test suite thoroughly covers calculate_baseline_and_std_per_prompt with:

Basic functionality with multiple generations per prompt

Edge cases: single generation, identical rewards, mixed prompt sizes

Empty input handling

NaN value handling

CUDA compatibility

Numerical precision with extreme values

Well done on the comprehensive coverage!

tests/unit/algorithms/test_grpo.py (1)

1204-1275: Good test coverage for normalize_advantages_with_epsilon.

The test suite covers key scenarios:

Basic normalization with non-zero std

Zero std handling with epsilon fallback

All-zero std edge case

Various tensor shapes and batch sizes

Negative advantages

The tests validate the epsilon-based division approach for numerical stability.
nemo_rl/algorithms/grpo.py (4)
535-551: Well-designed utility function for stable advantage normalization.

The normalize_advantages_with_epsilon function:

Uses epsilon (default 1e-6) to prevent division by zero

Properly handles tensor shapes with unsqueeze(-1) for broadcasting

Has clear documentation with parameter descriptions

Replaces previous masking-based approaches with a simpler, more stable method

1033-1036: Verify advantages calculation occurs before normalization.

The call to normalize_advantages_with_epsilon at line 1033 uses advantages computed at line 1030. Ensure that the advantages tensor at this point contains the correct unnormalized advantages (rewards - baseline) before normalization.

Looking at line 1030: advantages = (rewards - baseline).unsqueeze(-1) - this is correct.

1154-1157: Advantages metrics tracked for debugging.

The addition of advantages/mean, advantages/max, and advantages/min metrics provides valuable visibility into advantage distribution, which aligns with the PR objective to aid debugging of unstable runs.

Note: These metrics are computed after normalization (if enabled), so they reflect normalized advantages when normalize_rewards=True. Consider logging both unnormalized and normalized advantage stats for more complete debugging information.

If you want to track both unnormalized and normalized advantages, consider adding metrics before line 1033:
# Before normalization
metrics_before_norm = {
    "advantages_unnormalized/mean": torch.mean(advantages).detach().item(),
    "advantages_unnormalized/max": torch.max(advantages).detach().item(),
    "advantages_unnormalized/min": torch.min(advantages).detach().item(),
}
1880-1883: Consistent advantages tracking in async GRPO path.

The async GRPO path correctly:

Uses normalize_advantages_with_epsilon for normalization (lines 1880-1883)

Logs advantages statistics (lines 2018-2021)

This ensures feature parity between sync and async GRPO implementations.

Also applies to: 2018-2021

nemo_rl/algorithms/utils.py

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

…#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

ffrujeri requested review from a team as code owners October 24, 2025 16:40

ffrujeri marked this pull request as draft October 24, 2025 16:40

NVIDIA-NeMo deleted a comment from github-actions bot Oct 27, 2025

ffrujeri force-pushed the ffrujeri/grpo_improvements branch from 5d83308 to 8558320 Compare October 27, 2025 23:11

ffrujeri marked this pull request as ready for review October 27, 2025 23:11

ffrujeri requested a review from a team as a code owner October 27, 2025 23:11

ffrujeri changed the title ~~Add grpo_improvements.~~ feat: enhance advantage tracking and normalization stability in GRPO Oct 27, 2025

ffrujeri force-pushed the ffrujeri/grpo_improvements branch from 8558320 to fc5f44b Compare October 27, 2025 23:36

ffrujeri added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Oct 28, 2025

ffrujeri had a problem deploying to nemo-ci October 28, 2025 00:26 — with GitHub Actions Failure

ffrujeri temporarily deployed to nemo-ci October 28, 2025 06:47 — with GitHub Actions Inactive

ffrujeri changed the title ~~feat: enhance advantage tracking and normalization stability in GRPO~~ feat: enhance advantages tracking and normalization stability in GRPO Oct 28, 2025

ffrujeri temporarily deployed to nemo-ci October 28, 2025 08:49 — with GitHub Actions Inactive

ffrujeri force-pushed the ffrujeri/grpo_improvements branch from fc5f44b to dee2d6f Compare October 28, 2025 22:58

ffrujeri added CI:L0 Run doctests and unit tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Oct 30, 2025

ffrujeri temporarily deployed to nemo-ci October 30, 2025 05:02 — with GitHub Actions Inactive

ffrujeri force-pushed the ffrujeri/grpo_improvements branch 2 times, most recently from a01c57f to 0dcc5b8 Compare October 31, 2025 22:24

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

nemo_rl/algorithms/utils.py Outdated Show resolved Hide resolved

ffrujeri removed the CI:L0 Run doctests and unit tests label Nov 1, 2025

terrykong added the r0.4.0 label Nov 3, 2025

terrykong previously approved these changes Nov 4, 2025

View reviewed changes

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 4, 2025

terrykong enabled auto-merge (squash) November 4, 2025 06:18

terrykong temporarily deployed to nemo-ci November 4, 2025 06:18 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 4, 2025 06:19 — with GitHub Actions Inactive

ffrujeri dismissed terrykong’s stale review via 0663fde November 4, 2025 17:43

ffrujeri temporarily deployed to nemo-ci November 7, 2025 09:02 — with GitHub Actions Inactive

ffrujeri temporarily deployed to nemo-ci November 7, 2025 12:24 — with GitHub Actions Inactive

ffrujeri force-pushed the ffrujeri/grpo_improvements branch 2 times, most recently from 66d66d5 to 020d98c Compare November 11, 2025 16:57

ffrujeri added 5 commits November 11, 2025 09:46

Refactor normalize_advantages function and add tests.

f3cbbc4

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Update tests, rebase from main.

eb961c8

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Remove lingering std.

e6c0070

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Fix unit tests.

4783ef0

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Compute advantages only for masked response tokens.

2884e30

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri force-pushed the ffrujeri/grpo_improvements branch from 020d98c to 2884e30 Compare November 11, 2025 17:46

ffrujeri added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 12, 2025

ffrujeri temporarily deployed to nemo-ci November 12, 2025 18:10 — with GitHub Actions Inactive

ffrujeri temporarily deployed to nemo-ci November 12, 2025 18:11 — with GitHub Actions Inactive

ffrujeri added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 13, 2025

ffrujeri temporarily deployed to nemo-ci November 13, 2025 05:47 — with GitHub Actions Inactive

ffrujeri temporarily deployed to nemo-ci November 13, 2025 05:48 — with GitHub Actions Inactive

terrykong approved these changes Nov 13, 2025

View reviewed changes

ffrujeri temporarily deployed to nemo-ci November 13, 2025 08:08 — with GitHub Actions Inactive

terrykong merged commit 7124e44 into main Nov 13, 2025
63 of 64 checks passed

terrykong deleted the ffrujeri/grpo_improvements branch November 13, 2025 08:59

chtruong814 pushed a commit that referenced this pull request Nov 13, 2025

feat: enhance advantages tracking and normalization stability in GRPO (…

3e04e3b

…#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai bot mentioned this pull request Nov 13, 2025

cp: feat: enhance advantages tracking and normalization stability in GRPO (1423) into r0.4.0 #1516

Merged

coderabbitai bot mentioned this pull request Nov 18, 2025

cp: fix: Incompatible configuration between reward normalization and the loo (1519) into r0.4.0 #1533

Merged

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

feat: enhance advantages tracking and normalization stability in GRPO (…

e340d73

…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026

feat: enhance advantages tracking and normalization stability in GRPO (…

c0c6c02

…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

This was referenced Jan 15, 2026

feat: NeMo Gym refresh 20260113 #1773

Merged

feat: Implement ProRLv2 recipe #1809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhance advantages tracking and normalization stability in GRPO#1423

feat: enhance advantages tracking and normalization stability in GRPO#1423
terrykong merged 5 commits intomainfrom
ffrujeri/grpo_improvements

ffrujeri commented Oct 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ffrujeri commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Key Changes

1. Enhanced Advantage Tracking

2. Improved Advantage Normalization

3. Enhanced Standard Deviation Calculation

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ffrujeri commented Oct 24, 2025 •

edited

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading