Skip to content

feat: enhance advantages tracking and normalization stability in GRPO#1423

Merged
terrykong merged 5 commits intomainfrom
ffrujeri/grpo_improvements
Nov 13, 2025
Merged

feat: enhance advantages tracking and normalization stability in GRPO#1423
terrykong merged 5 commits intomainfrom
ffrujeri/grpo_improvements

Conversation

@ffrujeri
Copy link
Contributor

@ffrujeri ffrujeri commented Oct 24, 2025

What does this PR do ?

Improves advantages tracking and normalization in GRPO algorithms for better training stability and debugging.

Issues

Key Changes

1. Enhanced Advantage Tracking

Added metrics tracking for advantages/mean, advantages/max, and advantages/min

  • These metrics help with debugging unstable training runs by providing visibility into advantage distribution

2. Improved Advantage Normalization

  • New function: normalize_advantages_with_epsilon() that uses epsilon (1e-6) to avoid division by zero instead of masking
  • Replaces: Previous zero-standard-deviation masking approach in both sync and async GRPO
  • Benefits: More numerically stable and avoids potential issues with masked advantages

3. Enhanced Standard Deviation Calculation

  • Fixed: Potential numerical issues in calculate_baseline_and_std_per_prompt()
  • Added: More robust standard deviation computation with proper handling of edge cases

Usage

The advantage tracking metrics will automatically appear in your training logs when using GRPO. For example:

# These metrics will now be logged during training:
metrics = {
    "advantages/mean": 0.125,    # Mean advantage value
    "advantages/max": 2.456,     # Maximum advantage value  
    "advantages/min": -1.234,    # Minimum advantage value
    # ... other metrics
}

Before your PR is "Ready for review"

Pre checks:

  • [ x] Make sure you read and followed Contributor guidelines
  • [ x] Did you write any new necessary tests?
  • [ x] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • [ x] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • Debugging: The new advantages metrics are particularly helpful for debugging unstable runs in math training recipes
  • Numerical Stability: The epsilon-based normalization prevents division by zero errors that could occur with the previous masking approach
  • Backward Compatibility: All changes are backward compatible and don't affect existing configurations
  • Performance: No significant performance impact expected from the additional metric tracking

This improvement addresses stability issues that could affect tensor parallelism configurations in math training recipes, providing better observability and more robust normalization.

image

The comparison between baseline and treatment metrics can be seen in this report

In terms of perf time_per_step:
image

In terms of training_rewards:
image

Summary by CodeRabbit

  • Bug Fixes

    • Improved numerical stability in advantage normalization during training by introducing epsilon-based handling to prevent division-by-zero errors in edge cases.
  • Tests

    • Added comprehensive test coverage for advantage normalization functionality and baseline/standard deviation computations across various scenarios and tensor shapes.

@ffrujeri ffrujeri requested review from a team as code owners October 24, 2025 16:40
@ffrujeri ffrujeri marked this pull request as draft October 24, 2025 16:40
@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from github-actions bot Oct 27, 2025
@ffrujeri ffrujeri force-pushed the ffrujeri/grpo_improvements branch from 5d83308 to 8558320 Compare October 27, 2025 23:11
@ffrujeri ffrujeri marked this pull request as ready for review October 27, 2025 23:11
@ffrujeri ffrujeri requested a review from a team as a code owner October 27, 2025 23:11
@ffrujeri ffrujeri changed the title Add grpo_improvements. feat: enhance advantage tracking and normalization stability in GRPO Oct 27, 2025
@ffrujeri ffrujeri force-pushed the ffrujeri/grpo_improvements branch from 8558320 to fc5f44b Compare October 27, 2025 23:36
@ffrujeri ffrujeri added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Oct 28, 2025
@ffrujeri ffrujeri changed the title feat: enhance advantage tracking and normalization stability in GRPO feat: enhance advantages tracking and normalization stability in GRPO Oct 28, 2025
@ffrujeri ffrujeri force-pushed the ffrujeri/grpo_improvements branch from fc5f44b to dee2d6f Compare October 28, 2025 22:58
@ffrujeri ffrujeri added CI:L0 Run doctests and unit tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Oct 30, 2025
@ffrujeri ffrujeri force-pushed the ffrujeri/grpo_improvements branch 2 times, most recently from a01c57f to 0dcc5b8 Compare October 31, 2025 22:24
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 31, 2025

📝 Walkthrough

Walkthrough

The PR adds a new normalize_advantages_with_epsilon utility function for stable advantage normalization, refactors std calculation in utils, extends advantage normalization across multiple training paths, and adds logging metrics for advantages statistics (mean, max, min) to support debugging of RL training stability.

Changes

Cohort / File(s) Summary
Core normalization and advantage tracking
nemo_rl/algorithms/grpo.py
Added normalize_advantages_with_epsilon function for numerically stable advantage normalization; integrated into grpo_train, async_grpo_train, and dynamic reward processing paths; added logging metrics for advantages statistics (mean, max, min) at multiple training checkpoints
Baseline and std computation
nemo_rl/algorithms/utils.py
Modified std tensor initialization and computation; added per-prompt std calculation within the prompt loop, then globally recompute std (overwriting per-prompt calculations); logic flow remains unchanged
Test coverage for normalization
tests/unit/algorithms/test_grpo.py
Added five comprehensive test functions for normalize_advantages_with_epsilon: basic normalization, zero std handling, all-zero std, tensor shape variants, and negative advantages
Test coverage for baseline utils
tests/unit/algorithms/test_utils.py
Added extensive tests for calculate_baseline_and_std_per_prompt covering multiple prompts, single generation, identical rewards, empty inputs, NaN handling, CUDA compatibility, and numerical precision edge cases

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas requiring extra attention:
    • The per-prompt std calculation in utils.py is computed and then immediately overwritten by the global std calculation, indicating potential redundant or incomplete logic
    • Ensure advantage logging doesn't introduce performance overhead in high-frequency training loops
    • Verify epsilon value (1e-6) is appropriate across different reward scales and batch configurations
    • Cross-check that all training paths (grpo_train, async_grpo_train, dynamic processing) consistently apply the new normalization

Suggested reviewers

  • terrykong
  • parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning This PR introduces major changes that directly affect numerics in GRPO training: a new normalize_advantages_with_epsilon function replaces previous masking logic, normalization is extended across multiple training paths (grpo_train, async_grpo_train, dynamic sampling), and calculate_baseline_and_std_per_prompt computation is modified. Since these changes affect core numerical operations that impact training stability and convergence, the PR should document test results demonstrating no regression. The PR includes comprehensive unit test coverage with 122 total assertions across 13 new test functions covering edge cases and numerical precision. However, the PR description only states "local unit and functional tests run" without providing actual test results, CI logs, performance benchmarks, or convergence comparisons. The automated CI/CD (L0 unit tests) will execute, but the check requires that test results be documented in the PR description itself for changes affecting numerics. To pass this check, the PR description must be updated to include one of the following: (1) explicit test results or a link to the CI/CD execution logs showing all unit, functional, and convergence tests pass; (2) convergence benchmarks or training stability metrics demonstrating no regression from the numerical changes; or (3) performance comparison data before and after the changes in relevant configurations. Given that this PR modifies core numerical operations affecting training stability and convergence (as indicated in the linked issue #1395), documentation of test results within the PR description is necessary to meet the standard for major changes affecting numerics.
✅ Passed checks (5 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "feat: enhance advantages tracking and normalization stability in GRPO" directly and accurately reflects the main changes in the changeset. The title succinctly captures the two primary objectives: adding advantages statistics tracking (mean, max, min) and improving normalization stability through the new normalize_advantages_with_epsilon function. The title is specific, concise, and avoids vague or generic language, clearly communicating the essence of the changes from the developer's perspective.
Linked Issues Check ✅ Passed The code changes directly address the requirement from linked issue #1395 to track advantages on RL algorithms. The implementation adds logging metrics for advantages/mean, advantages/max, and advantages/min across multiple GRPO training paths (grpo_train, async_grpo_train, and dynamic sampling branches), providing the requested visibility into advantage distribution for debugging unstable runs. Additionally, the new normalize_advantages_with_epsilon function and enhanced standard deviation computation in utils.py address the normalization stability improvements mentioned in the PR objectives, all of which are in service of the core tracking objective.
Out of Scope Changes Check ✅ Passed All changes are directly aligned with the stated objectives of enhancing advantages tracking and normalization stability in GRPO. The modifications to nemo_rl/algorithms/grpo.py add the required metrics tracking and the new normalization function, modifications to nemo_rl/algorithms/utils.py improve the underlying standard deviation computation for numerical robustness, and the test additions appropriately provide coverage for both new and modified functions. No out-of-scope changes that deviate from the core objectives are evident in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ffrujeri/grpo_improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 855151b and 0dcc5b8.

📒 Files selected for processing (4)
  • nemo_rl/algorithms/grpo.py (5 hunks)
  • nemo_rl/algorithms/utils.py (2 hunks)
  • tests/unit/algorithms/test_grpo.py (2 hunks)
  • tests/unit/algorithms/test_utils.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/algorithms/grpo.py
  • tests/unit/algorithms/test_utils.py
  • tests/unit/algorithms/test_grpo.py
  • nemo_rl/algorithms/utils.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/algorithms/grpo.py
  • nemo_rl/algorithms/utils.py
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (1)
tests/check_metrics.py (3)
  • mean (52-97)
  • max (30-32)
  • min (25-27)
tests/unit/algorithms/test_utils.py (1)
nemo_rl/algorithms/utils.py (1)
  • calculate_baseline_and_std_per_prompt (51-129)
tests/unit/algorithms/test_grpo.py (1)
nemo_rl/algorithms/grpo.py (1)
  • normalize_advantages_with_epsilon (535-551)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (6)
tests/unit/algorithms/test_utils.py (1)

397-595: Comprehensive test coverage for baseline and std calculations.

The test suite thoroughly covers calculate_baseline_and_std_per_prompt with:

  • Basic functionality with multiple generations per prompt
  • Edge cases: single generation, identical rewards, mixed prompt sizes
  • Empty input handling
  • NaN value handling
  • CUDA compatibility
  • Numerical precision with extreme values

Well done on the comprehensive coverage!

tests/unit/algorithms/test_grpo.py (1)

1204-1275: Good test coverage for normalize_advantages_with_epsilon.

The test suite covers key scenarios:

  • Basic normalization with non-zero std
  • Zero std handling with epsilon fallback
  • All-zero std edge case
  • Various tensor shapes and batch sizes
  • Negative advantages

The tests validate the epsilon-based division approach for numerical stability.

nemo_rl/algorithms/grpo.py (4)

535-551: Well-designed utility function for stable advantage normalization.

The normalize_advantages_with_epsilon function:

  • Uses epsilon (default 1e-6) to prevent division by zero
  • Properly handles tensor shapes with unsqueeze(-1) for broadcasting
  • Has clear documentation with parameter descriptions
  • Replaces previous masking-based approaches with a simpler, more stable method

1033-1036: Verify advantages calculation occurs before normalization.

The call to normalize_advantages_with_epsilon at line 1033 uses advantages computed at line 1030. Ensure that the advantages tensor at this point contains the correct unnormalized advantages (rewards - baseline) before normalization.

Looking at line 1030: advantages = (rewards - baseline).unsqueeze(-1) - this is correct.


1154-1157: Advantages metrics tracked for debugging.

The addition of advantages/mean, advantages/max, and advantages/min metrics provides valuable visibility into advantage distribution, which aligns with the PR objective to aid debugging of unstable runs.

Note: These metrics are computed after normalization (if enabled), so they reflect normalized advantages when normalize_rewards=True. Consider logging both unnormalized and normalized advantage stats for more complete debugging information.

If you want to track both unnormalized and normalized advantages, consider adding metrics before line 1033:

# Before normalization
metrics_before_norm = {
    "advantages_unnormalized/mean": torch.mean(advantages).detach().item(),
    "advantages_unnormalized/max": torch.max(advantages).detach().item(),
    "advantages_unnormalized/min": torch.min(advantages).detach().item(),
}

1880-1883: Consistent advantages tracking in async GRPO path.

The async GRPO path correctly:

  • Uses normalize_advantages_with_epsilon for normalization (lines 1880-1883)
  • Logs advantages statistics (lines 2018-2021)

This ensures feature parity between sync and async GRPO implementations.

Also applies to: 2018-2021

@ffrujeri ffrujeri removed the CI:L0 Run doctests and unit tests label Nov 1, 2025
terrykong
terrykong previously approved these changes Nov 4, 2025
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 4, 2025
@terrykong terrykong enabled auto-merge (squash) November 4, 2025 06:18
@ffrujeri ffrujeri force-pushed the ffrujeri/grpo_improvements branch 2 times, most recently from 66d66d5 to 020d98c Compare November 11, 2025 16:57
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri force-pushed the ffrujeri/grpo_improvements branch from 020d98c to 2884e30 Compare November 11, 2025 17:46
@ffrujeri ffrujeri added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 12, 2025
@ffrujeri ffrujeri added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 13, 2025
@terrykong terrykong merged commit 7124e44 into main Nov 13, 2025
63 of 64 checks passed
@terrykong terrykong deleted the ffrujeri/grpo_improvements branch November 13, 2025 08:59
chtruong814 pushed a commit that referenced this pull request Nov 13, 2025
…#1423)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
zpqiu pushed a commit to sharonyu-115/NeMo-RL that referenced this pull request Nov 17, 2025
…NVIDIA-NeMo#1423)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…NVIDIA-NeMo#1423)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests CI:L2 Run doctests, unit tests, functional tests, and convergence tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track advantages on the RL algorithms

2 participants