feat: enhance advantages tracking and normalization stability in GRPO#1423
feat: enhance advantages tracking and normalization stability in GRPO#1423
Conversation
5d83308 to
8558320
Compare
8558320 to
fc5f44b
Compare
fc5f44b to
dee2d6f
Compare
a01c57f to
0dcc5b8
Compare
📝 WalkthroughWalkthroughThe PR adds a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
nemo_rl/algorithms/grpo.py(5 hunks)nemo_rl/algorithms/utils.py(2 hunks)tests/unit/algorithms/test_grpo.py(2 hunks)tests/unit/algorithms/test_utils.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/algorithms/grpo.pytests/unit/algorithms/test_utils.pytests/unit/algorithms/test_grpo.pynemo_rl/algorithms/utils.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/algorithms/grpo.pynemo_rl/algorithms/utils.py
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (1)
tests/check_metrics.py (3)
mean(52-97)max(30-32)min(25-27)
tests/unit/algorithms/test_utils.py (1)
nemo_rl/algorithms/utils.py (1)
calculate_baseline_and_std_per_prompt(51-129)
tests/unit/algorithms/test_grpo.py (1)
nemo_rl/algorithms/grpo.py (1)
normalize_advantages_with_epsilon(535-551)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (6)
tests/unit/algorithms/test_utils.py (1)
397-595: Comprehensive test coverage for baseline and std calculations.The test suite thoroughly covers
calculate_baseline_and_std_per_promptwith:
- Basic functionality with multiple generations per prompt
- Edge cases: single generation, identical rewards, mixed prompt sizes
- Empty input handling
- NaN value handling
- CUDA compatibility
- Numerical precision with extreme values
Well done on the comprehensive coverage!
tests/unit/algorithms/test_grpo.py (1)
1204-1275: Good test coverage for normalize_advantages_with_epsilon.The test suite covers key scenarios:
- Basic normalization with non-zero std
- Zero std handling with epsilon fallback
- All-zero std edge case
- Various tensor shapes and batch sizes
- Negative advantages
The tests validate the epsilon-based division approach for numerical stability.
nemo_rl/algorithms/grpo.py (4)
535-551: Well-designed utility function for stable advantage normalization.The
normalize_advantages_with_epsilonfunction:
- Uses epsilon (default 1e-6) to prevent division by zero
- Properly handles tensor shapes with
unsqueeze(-1)for broadcasting- Has clear documentation with parameter descriptions
- Replaces previous masking-based approaches with a simpler, more stable method
1033-1036: Verify advantages calculation occurs before normalization.The call to
normalize_advantages_with_epsilonat line 1033 usesadvantagescomputed at line 1030. Ensure that theadvantagestensor at this point contains the correct unnormalized advantages (rewards - baseline) before normalization.Looking at line 1030:
advantages = (rewards - baseline).unsqueeze(-1)- this is correct.
1154-1157: Advantages metrics tracked for debugging.The addition of
advantages/mean,advantages/max, andadvantages/minmetrics provides valuable visibility into advantage distribution, which aligns with the PR objective to aid debugging of unstable runs.Note: These metrics are computed after normalization (if enabled), so they reflect normalized advantages when
normalize_rewards=True. Consider logging both unnormalized and normalized advantage stats for more complete debugging information.If you want to track both unnormalized and normalized advantages, consider adding metrics before line 1033:
# Before normalization metrics_before_norm = { "advantages_unnormalized/mean": torch.mean(advantages).detach().item(), "advantages_unnormalized/max": torch.max(advantages).detach().item(), "advantages_unnormalized/min": torch.min(advantages).detach().item(), }
1880-1883: Consistent advantages tracking in async GRPO path.The async GRPO path correctly:
- Uses
normalize_advantages_with_epsilonfor normalization (lines 1880-1883)- Logs advantages statistics (lines 2018-2021)
This ensures feature parity between sync and async GRPO implementations.
Also applies to: 2018-2021
66d66d5 to
020d98c
Compare
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
020d98c to
2884e30
Compare
…#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…NVIDIA-NeMo#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
Improves advantages tracking and normalization in GRPO algorithms for better training stability and debugging.
Issues
Key Changes
1. Enhanced Advantage Tracking
Added metrics tracking for
advantages/mean,advantages/max, andadvantages/min2. Improved Advantage Normalization
normalize_advantages_with_epsilon()that uses epsilon (1e-6) to avoid division by zero instead of masking3. Enhanced Standard Deviation Calculation
calculate_baseline_and_std_per_prompt()Usage
The advantage tracking metrics will automatically appear in your training logs when using GRPO. For example:
Before your PR is "Ready for review"
Pre checks:
Additional Information
This improvement addresses stability issues that could affect tensor parallelism configurations in math training recipes, providing better observability and more robust normalization.
The comparison between baseline and treatment metrics can be seen in this report
In terms of perf time_per_step:

In terms of training_rewards:

Summary by CodeRabbit
Bug Fixes
Tests