Skip to content

fix: Fix the logger error in non-colocated sync-grpo code path#1355

Merged
terrykong merged 1 commit intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:fix-perf-log
Oct 16, 2025
Merged

fix: Fix the logger error in non-colocated sync-grpo code path#1355
terrykong merged 1 commit intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:fix-perf-log

Conversation

@youngeunkwon0405
Copy link
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Oct 14, 2025

What does this PR do ?

There was a bug when using non-colocated but sync-grpo code path.
metric['per_worker_token_counts'] is a dict type which cannot be logged in wandb.
The existing code was deleting this only for async-grpo training loop.
This PR moves the deletion code to the perf metric function, which could affect all the use cases.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Improvements

    • Expanded logged metrics to include per-worker token counts across training, performance, and timing views.
    • Improved clarity and completeness of reported metrics during runs.
  • Bug Fixes

    • Prevented errors during metric visualization by automatically removing incompatible entries before rendering.
    • Increased robustness of logging with third-party integrations to reduce interruptions during runs.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405 youngeunkwon0405 self-assigned this Oct 14, 2025
@youngeunkwon0405 youngeunkwon0405 requested a review from a team as a code owner October 14, 2025 17:53
@youngeunkwon0405 youngeunkwon0405 added the CI:L0 Run doctests and unit tests label Oct 14, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

📝 Walkthrough

Walkthrough

Removed metrics filtering in async_grpo_train so per_worker_token_counts remains in logged metrics. Added cleanup in visualize_performance_metrics to drop per_worker_token_counts from the metrics dict before WandB logging to prevent errors. No changes to public APIs.

Changes

Cohort / File(s) Summary
GRPO training metrics handling
nemo_rl/algorithms/grpo.py
Stopped removing per_worker_token_counts from metrics before logging; it now propagates through performance/train/timing logs.
Metrics visualization cleanup
nemo_rl/algorithms/utils.py
In visualize_performance_metrics, added a step to delete per_worker_token_counts from metrics (if present) before constructing performance_metrics to avoid WandB logging errors.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Trainer as async_grpo_train
  participant Logger as Logger/WandB
  participant Utils as visualize_performance_metrics

  Trainer->>Trainer: Build metrics (includes per_worker_token_counts)
  Note right of Trainer: No filtering in async_grpo_train
  Trainer->>Logger: Log performance/train/timing metrics

  Trainer->>Utils: visualize_performance_metrics(metrics)
  Utils->>Utils: Remove per_worker_token_counts from metrics
  Utils->>Logger: Log performance_metrics (WandB-safe)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • feat: Adding perf metrics #1183 — Adjusts handling/logging of per_worker_token_counts in GRPO metrics pipeline; overlaps with this PR’s metrics cleanup and logging behavior.

Suggested labels

Performance

Suggested reviewers

  • parthchadha

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title concisely states the fix for the logger error in the non-colocated sync-GRPO code path, directly reflecting the primary change in this pull request. It is specific and clear enough for team members to understand the purpose without additional context.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed This PR only modifies the metrics logging to remove a dict‐valued entry to prevent wandb errors and does not introduce new features, change convergence, or affect performance numbers, so test results are not required for this minor bugfix.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_rl/algorithms/utils.py (1)

556-564: LGTM! Consider optional idiom improvement.

The cleanup correctly prevents WandB logging errors by removing the dict-valued metric after it's been processed.

Optional: As suggested by Ruff, use .pop() for slightly more idiomatic code:

-    # Clean up metrics to avoid wandb logging errors
-    # Dict structures cannot be logged to wandb
-    if "per_worker_token_counts" in metrics:
-        del metrics["per_worker_token_counts"]
+    # Clean up metrics to avoid wandb logging errors
+    # Dict structures cannot be logged to wandb
+    metrics.pop("per_worker_token_counts", None)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 355aa98 and 77ac82e.

📒 Files selected for processing (2)
  • nemo_rl/algorithms/grpo.py (0 hunks)
  • nemo_rl/algorithms/utils.py (1 hunks)
💤 Files with no reviewable changes (1)
  • nemo_rl/algorithms/grpo.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/algorithms/utils.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/algorithms/utils.py
🪛 Ruff (0.14.0)
nemo_rl/algorithms/utils.py

563-563: Use pop instead of key in dict followed by del dict[key]

Replace if statement with .pop(..., None)

(RUF051)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

@terrykong terrykong added r0.4.0 CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Oct 14, 2025
@terrykong terrykong enabled auto-merge (squash) October 14, 2025 17:59
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 15, 2025
@terrykong terrykong merged commit 96656c3 into NVIDIA-NeMo:main Oct 16, 2025
98 of 104 checks passed
chtruong814 pushed a commit that referenced this pull request Oct 16, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
lbliii pushed a commit that referenced this pull request Nov 3, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…A-NeMo#1355)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants