Skip to content

feat: NeMo Gym refresh 20260113#1773

Merged
yfw merged 10 commits intomainfrom
bxyu/nemo-gym-refresh-20260113
Jan 18, 2026
Merged

feat: NeMo Gym refresh 20260113#1773
yfw merged 10 commits intomainfrom
bxyu/nemo-gym-refresh-20260113

Conversation

@bxyu-nvidia
Copy link
Contributor

@bxyu-nvidia bxyu-nvidia commented Jan 14, 2026

What does this PR do ?

  1. Refactor Gym workspace from stub module to true Gym repo
    1. 3rdparty/Gym-workspace/is_nemo_gym_installed.py
    2. 3rdparty/Gym-workspace/pyproject.toml
    3. 3rdparty/Gym-workspace/setup.py
    4. pyproject.toml
    5. uv.lock
  2. Clean remaining "Penguin" references
    1. docs/design-docs/dependency-management.md
  3. Clean and update Gym config and launch files
    1. examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
    2. examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
    3. examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
    4. examples/nemo_gym/launch_nemo_gym_multinode_training.sh
    5. nemo_rl/environments/nemo_gym.py
  4. O(n * log(n)) bin packing algorithm - significantly speeds up larger rollout batch size jobs
    1. nemo_rl/data/packing/algorithms.py
  5. Fix off-by-one logging problem
    1. step_finished logic in nemo_rl/utils/logger.py and nemo_rl/algorithms/grpo.py
  6. Upload raylet.out and raylet.err files to W&B
    1. raylet logic in nemo_rl/utils/logger.py
  7. Support for logging responses to disk. Saves several mins for larger rollout batch size jobs
    1. log_string_list_as_jsonl logic in nemo_rl/utils/logger.py
  8. CPU and Ray memory tracking util
    1. nemo_rl/utils/memory_tracker.py
    2. nemo_rl/algorithms/grpo.py
  9. Opt-out NeMo Gym logging to W&B
    1. should_log_nemo_gym_responses logic in nemo_rl/algorithms/grpo.py
  10. Opt-in skip reference logprobs calculation when KL penalty is 0
    1. skip_reference_policy_logprobs_calculation logic in nemo_rl/algorithms/grpo.py
    2. nemo_rl/algorithms/loss_functions.py
  11. Support advantage calculation on GPU
    1. calculate_advantages_on_gpu in nemo_rl/algorithms/grpo.py
  12. Reduce CPU usage in GRPO loop
    1. del * variable deletions in nemo_rl/algorithms/grpo.py
  13. Log mixed reward
    1. baseline_reward/pct_0 related logic in nemo_rl/algorithms/grpo.py

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

Release Notes

  • New Features

    • GRPO algorithm now supports GPU-based advantage calculations and optional response logging.
    • Added memory tracking instrumentation for training pipeline stages.
    • New JSONL logging capability for structured metrics output.
    • Enhanced multinode training script with SLURM node count configuration.
    • New Qwen3 30B GRPO configuration example.
  • Documentation

    • Updated dependency management guide.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 14, 2026
@terrykong terrykong requested a review from yfw January 14, 2026 01:09
@bxyu-nvidia bxyu-nvidia changed the title NeMo Gym refresh 20260113 feat: NeMo Gym refresh 20260113 Jan 14, 2026
@bxyu-nvidia bxyu-nvidia marked this pull request as ready for review January 15, 2026 23:36
@bxyu-nvidia bxyu-nvidia requested review from a team as code owners January 15, 2026 23:37
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@bxyu-nvidia bxyu-nvidia force-pushed the bxyu/nemo-gym-refresh-20260113 branch from f7a293f to 1a3982f Compare January 15, 2026 23:41
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 15, 2026

📝 Walkthrough

Walkthrough

This PR removes the Gym-workspace build infrastructure (setup.py, pyproject.toml, installation check script), refactors GRPO training with memory tracking instrumentation and improved logging, introduces a MemoryTracker utility for memory profiling, optimizes FFD bin packing from O(nm) to O(nlog n), and updates workspace and example configurations accordingly.

Changes

Cohort / File(s) Summary
Gym Workspace Removal
3rdparty/Gym-workspace/is_nemo_gym_installed.py, 3rdparty/Gym-workspace/pyproject.toml, 3rdparty/Gym-workspace/setup.py
Deletes packaging infrastructure: removes install check script, project metadata configuration, and setuptools setup script with dependency validation logic.
GRPO Training Enhancement
nemo_rl/algorithms/grpo.py
Adds MemoryTracker integration throughout training/validation flow with per-stage memory snapshots. Introduces two new config fields (skip_reference_policy_logprobs_calculation, calculate_advantages_on_gpu), helper functions for conditional NeMo-Gym logging and mixed rewards/advantages logging, and optional GPU-based advantage calculation with memory instrumentation.
Loss Function Optimization
nemo_rl/algorithms/loss_functions.py
Guards extraction of reference_policy_logprobs to only retrieve when KL penalty is non-zero, preventing unnecessary data preparation.
Packing Algorithm Optimization
nemo_rl/data/packing/algorithms.py
Replaces O(n\m) FFD leftovers packing with O(nlog n) binary-search insertion strategy using bisect, improving efficiency for bin-packing leftovers phase.
Utilities & Infrastructure
nemo_rl/utils/memory_tracker.py, nemo_rl/utils/logger.py, nemo_rl/environments/nemo_gym.py
Introduces MemoryTracker module with memory profiling per-stage. Extends Logger with step_finished parameter and new log_string_list_as_jsonl() method for JSONL file logging. Adds runtime type assertion for nemo_gym_result validation.
Example Configurations
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml, examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml, examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml, examples/nemo_gym/launch_nemo_gym_multinode_training.sh
Replaces one GRPO config with a new Qwen-3-30B-A3B Instruct config. Updates workplace assistant config with code_gen mapping and trajectory collection flag. Enhances launch script with HF_TOKEN and NUM_SLURM_NODES parameters.
Configuration & Documentation
pyproject.toml, docs/design-docs/dependency-management.md
Updates workspace member path from 3rdparty/Gym-workspace to 3rdparty/Gym-workspace/Gym. Updates documentation references replacing "Penguin" with "NeMo Gym".

Sequence Diagram(s)

sequenceDiagram
    participant Training Loop
    participant MemoryTracker
    participant GRPO Logic
    participant Logger
    participant GPU Memory

    Training Loop->>MemoryTracker: snapshot_start_of_stage("Initial validation")
    MemoryTracker->>GPU Memory: Query current RSS memory
    MemoryTracker->>Logger: Print memory snapshot

    Training Loop->>GRPO Logic: Run validation
    
    Training Loop->>MemoryTracker: snapshot_start_of_stage("Preparing batch")
    MemoryTracker->>GPU Memory: Capture memory after validation
    MemoryTracker->>Logger: Log stage delta

    Training Loop->>GRPO Logic: Generate sequences
    Training Loop->>MemoryTracker: snapshot_start_of_stage("Generation")
    
    alt calculate_advantages_on_gpu enabled
        Training Loop->>GPU Memory: Calculate advantages on GPU
    else
        Training Loop->>GRPO Logic: Calculate advantages on CPU
    end

    Training Loop->>GRPO Logic: Compute logprobs
    
    alt skip_reference_policy_logprobs_calculation disabled
        GRPO Logic->>GRPO Logic: Extract reference policy logprobs
    end

    Training Loop->>GRPO Logic: Policy training
    Training Loop->>Logger: Log metrics with step_finished flag
    Logger->>Logger: Commit step if needed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • yfw
  • terrykong
  • chtruong814
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR contains major algorithmic and feature changes but lacks documented test results, performance benchmarks, or convergence validation in the description. Add performance benchmarks, convergence verification, edge case test coverage, and algorithm equivalence validation to the PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: NeMo Gym refresh 20260113' accurately reflects the main objective of refreshing and refactoring the NeMo Gym integration, though it lacks specificity about the scope of changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pyproject.toml (1)

176-182: The workspace member path is incomplete—pyproject.toml is missing.

Line 181 references 3rdparty/Gym-workspace/Gym as a workspace member, but this directory does not contain a pyproject.toml file. uv workspace members require a pyproject.toml at their root to be resolved. This will cause workspace resolution to fail.

nemo_rl/utils/logger.py (1)

926-938: Update test mock log_metrics signature to include the step_finished parameter.

The test mock at tests/unit/utils/test_logger.py:746 has signature def log_metrics(self, metrics, step, prefix="", step_metric=None) but the implementation at line 938 calls logger.log_metrics(metrics, step, prefix, step_metric, step_finished) with five positional arguments. This mismatch will cause a TypeError at runtime when tests execute.

🤖 Fix all issues with AI agents
In `@examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml`:
- Around line 26-27: The comment above the token_level_loss setting contains a
hardcoded date ("Mon Oct 13") that will become stale; update the comment near
token_level_loss to use a general phrasing (e.g., "token-level loss not yet
supported in NeMo RL as of this writing" or remove the date entirely) so it
doesn't reference a specific calendar date and remains accurate as the code
evolves.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 908-915: The helper _should_log_nemo_gym_responses currently
treats a missing env flag as False; change it to opt out only when the flag is
explicitly False and also require NeMo-Gym is in use: compute env_val =
env_config.get("should_log_nemo_gym_responses") and return
_should_use_nemo_gym(master_config) and (env_val is not False) so missing key
yields logging enabled; apply the same explicit False-check +
_should_use_nemo_gym(...) gating to the other occurrences of this pattern (the
similar blocks found elsewhere in this file).
- Around line 1292-1322: The GPU path hard-codes device_id = 0 which will fail
on CPU-only or different-GPU setups; update the block that calls
calculate_baseline_and_std_per_prompt (the branch gated by
master_config["grpo"].get("calculate_advantages_on_gpu")) to first check
torch.cuda.is_available() and otherwise fall back to CPU, and when CUDA is
available infer the target device from existing tensors (e.g., use
input_ids.device or rewards.device or set device = torch.device("cuda") if
available) and move tensors with .to(device) instead of using .cuda(0); keep the
leave_one_out_baseline logic and ensure baseline/std are moved back to CPU with
.cpu() as before if downstream expects CPU tensors.

In `@nemo_rl/data/packing/algorithms.py`:
- Around line 631-648: The current initialization of ffd_bins and ffd_bin_sizes
as [[]] and [0] causes an empty bin to be emitted when leftovers is empty;
change initialization to ffd_bins = [] and ffd_bin_sizes = [] and ensure the
loop that processes leftovers (iterating over sorted(leftovers, ...)) correctly
creates the first bin when needed (using the same new_bin/new_bin_size logic and
bisect insertion into ffd_bin_sizes) so no pre-existing empty bin is appended to
bins when leftovers is empty; update any code that assumed a starting bin to
handle empty lists accordingly (references: leftovers, ffd_bins, ffd_bin_sizes,
bisect insertion).

In `@nemo_rl/utils/memory_tracker.py`:
- Around line 30-40: The mem_used_diff_gb and new_variables properties can raise
TypeError when accessed while memory_used_after_stage_gb or
variables_after_stage are None; in the mem_used_diff_gb property (and in
new_variables) add defensive checks for None (e.g., if
self.memory_used_after_stage_gb is None or self.memory_used_before_stage_gb is
None) and either return a safe default (0.0 for mem diff, [] for new_variables)
or raise a clear ValueError explaining the field is unset; update the property
implementations in nemo_rl.utils.memory_tracker (mem_used_diff_gb and
new_variables) so callers of get_snapshot_str or direct property access get a
predictable error/return instead of a TypeError.
🧹 Nitpick comments (8)
examples/nemo_gym/launch_nemo_gym_multinode_training.sh (3)

1-1: Missing shebang line.

Per Google Shell Style Guide, shell scripts should start with a shebang specifying the interpreter. This ensures consistent execution regardless of how the script is invoked.

Proposed fix
+#!/bin/bash
 # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

1-1: Update copyright year to 2026.

As per coding guidelines, the NVIDIA copyright header should include the current year.

Proposed fix
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.

19-19: Quote variable to prevent word splitting.

If REPO_LOCATION contains spaces or special characters, this cd command will fail. Per Google Shell Style Guide, always quote variables unless you specifically need word splitting.

Proposed fix
-cd $REPO_LOCATION
+cd "$REPO_LOCATION"
nemo_rl/data/packing/algorithms.py (1)

616-628: Consider removing or properly documenting the old implementation.

The old O(n*m) implementation is preserved as a triple-quoted string, which is unconventional. Per coding guidelines, commented-out code should include a comment explaining why it's kept.

Consider either:

  1. Removing the old code entirely (it's in git history if needed)
  2. Converting to a proper comment with explanation of why it's preserved
nemo_rl/utils/memory_tracker.py (1)

54-59: Consider declaring _process as a Pydantic private attribute.

The _process attribute is assigned in model_post_init but not declared as a Pydantic PrivateAttr. While this works because Pydantic ignores underscore-prefixed attributes by default, explicitly declaring it improves clarity and ensures proper handling.

💡 Suggested improvement
+from pydantic import BaseModel, Field, PrivateAttr
+
 class MemoryTracker(BaseModel):
     data_points: List[MemoryTrackerDataPoint] = Field(default_factory=list)
+    _process: Process = PrivateAttr()

     def model_post_init(self, context):
         self._process = Process(os.getpid())
         return super().model_post_init(context)
nemo_rl/utils/logger.py (1)

975-991: Method name may be misleading - no JSON validation performed.

The method log_string_list_as_jsonl writes raw strings to a file without validating they are valid JSON. The caller is responsible for providing properly formatted JSON strings. Consider either:

  1. Adding JSON validation/serialization
  2. Renaming to log_string_list_to_file or adding a docstring note that strings must be pre-formatted as JSON
💡 Suggested docstring clarification
     def log_string_list_as_jsonl(self, to_log: list[str], filename: str) -> None:
         """Log a list of strings to a JSONL file.

         Args:
             to_log: list of strings to log
+                    Note: Each string should be a valid JSON object.
+                    No validation is performed.
             filename: Filename to log to (within the log directory)
         """
nemo_rl/algorithms/grpo.py (2)

128-145: Document new GRPOConfig fields + explicit defaults.

Please add purpose/valid values/defaults for skip_reference_policy_logprobs_calculation and calculate_advantages_on_gpu, and surface explicit defaults in exemplar YAMLs to avoid implicit code defaults. As per coding guidelines, please document new config keys and defaults.


1577-1580: Optional: avoid noisy per‑metric prints.

The new “Skipping aggregation…” prints will fire for every scalar/dict metric each step. Consider gating behind a debug flag or handling scalars explicitly to keep logs clean.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0b1a91 and f7a293f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • 3rdparty/Gym-workspace/is_nemo_gym_installed.py
  • 3rdparty/Gym-workspace/pyproject.toml
  • 3rdparty/Gym-workspace/setup.py
  • docs/design-docs/dependency-management.md
  • examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/data/packing/algorithms.py
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/utils/logger.py
  • nemo_rl/utils/memory_tracker.py
  • pyproject.toml
💤 Files with no reviewable changes (4)
  • 3rdparty/Gym-workspace/pyproject.toml
  • 3rdparty/Gym-workspace/setup.py
  • 3rdparty/Gym-workspace/is_nemo_gym_installed.py
  • examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
🧰 Additional context used
📓 Path-based instructions (6)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
  • docs/design-docs/dependency-management.md
  • pyproject.toml
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/utils/memory_tracker.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/utils/logger.py
  • nemo_rl/data/packing/algorithms.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/utils/memory_tracker.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/utils/logger.py
  • nemo_rl/data/packing/algorithms.py
docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section

Files:

  • docs/design-docs/dependency-management.md
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/utils/memory_tracker.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/utils/logger.py
  • nemo_rl/data/packing/algorithms.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/utils/memory_tracker.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/utils/logger.py
  • nemo_rl/data/packing/algorithms.py
🧠 Learnings (5)
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
📚 Learning: 2025-09-19T02:44:38.451Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:73-84
Timestamp: 2025-09-19T02:44:38.451Z
Learning: The scheduler configuration format with a separate "milestones: [20]" entry (not wrapped under name/kwargs) is a valid and established pattern used across GRPO, DPO, and distillation configs in the NeMo RL codebase. This format specifies transition points between different schedulers (e.g., LinearLR for warmup steps, then ConstantLR).

Applied to files:

  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.py : If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging

Applied to files:

  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (4)
nemo_rl/utils/memory_tracker.py (2)
  • MemoryTracker (54-80)
  • snapshot_start_of_stage (61-80)
nemo_rl/algorithms/utils.py (1)
  • calculate_baseline_and_std_per_prompt (79-156)
nemo_rl/models/policy/interfaces.py (2)
  • get_logprobs (54-66)
  • get_reference_policy_logprobs (69-83)
nemo_rl/distributed/batched_data_dict.py (1)
  • BatchedDataDict (75-860)
nemo_rl/utils/logger.py (1)
tests/unit/utils/test_logger.py (1)
  • log_metrics (746-750)
nemo_rl/data/packing/algorithms.py (1)
tests/unit/data/packing/test_algorithms.py (1)
  • bin_capacity (63-65)
🪛 Ruff (0.14.11)
nemo_rl/utils/logger.py

148-148: Unused method argument: step_finished

(ARG002)


209-209: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.out"

(S108)


210-210: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.err"

(S108)


422-422: Unused method argument: step_finished

(ARG002)


800-800: Unused method argument: step_finished

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (21)
examples/nemo_gym/launch_nemo_gym_multinode_training.sh (2)

40-46: LGTM on the optional NUM_SLURM_NODES parameter.

The parameter expansion ${NUM_SLURM_NODES:-$NUM_ACTOR_NODES} correctly defaults to NUM_ACTOR_NODES when NUM_SLURM_NODES is unset or empty. This is a clean way to make the parameter optional while maintaining backward compatibility.


25-27: HF_TOKEN integration looks correct.

The token is properly propagated to the environment for the uv run python command. Using uv run follows the coding guidelines.

docs/design-docs/dependency-management.md (1)

164-196: LGTM — naming alignment looks correct.
No concerns with the updated NeMo Gym references.

nemo_rl/environments/nemo_gym.py (1)

151-154: LGTM! Early input validation with informative error message.

The type assertion ensures that malformed or error responses from NeMo Gym are caught early with a clear, actionable error message that includes the actual value received.

examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml (2)

1-11: LGTM! Well-documented configuration for MoE model GRPO training.

The configuration properly leverages new PR features (calculate_advantages_on_gpu: true) and includes helpful comments explaining the rationale for parameter choices, particularly around MoE-specific data volume requirements.


70-99: Megatron configuration is well-structured with appropriate parallelism settings.

The parallelism configuration (TP=4, CP=2, EP=8) is properly documented with explanatory comments. The MoE-specific settings (router dtype, load balancing, permute fusion) follow best practices for stability.

nemo_rl/utils/memory_tracker.py (1)

42-51: LGTM! Clear and informative memory snapshot output.

The formatted output includes CPU memory deltas, new variable tracking, and Ray memory summary in a readable format suitable for debugging.

nemo_rl/utils/logger.py (3)

96-105: Interface signature updated for step completion semantics.

Adding step_finished: bool = False to the abstract interface enables backends that support explicit step completion (like W&B) to commit metrics at appropriate boundaries. The default False maintains backward compatibility.


366-371: LGTM! Explicit commit control for W&B step boundaries.

The step_finished handling with explicit commit=True ensures metrics are flushed at step boundaries, independent of W&B's default behavior. The comment explains the rationale well.


204-210: Hardcoded Ray session path is acceptable for debug-only feature, but consider using os.path.realpath() for robustness.

The hardcoded /tmp/ray/session_latest/logs/ path is Ray's standard session directory location and is flagged by static analysis (S108). Since this code is gated behind RAY_BACKEND_LOG_LEVEL=debug, the security risk is minimal. However, for improved robustness, consider using os.path.realpath("/tmp/ray/session_latest") to resolve the symlink programmatically rather than hardcoding the path.

nemo_rl/algorithms/loss_functions.py (1)

171-173: LGTM: conditional reference‑logprobs extraction.

The guard avoids unnecessary slicing when KL penalty is zero and matches the skip flag behavior.

examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (4)

204-204: Tool parser placeholder is clear.

The inline note makes it obvious where to set the Nemotron Nano v2 plugin path.


230-231: Dataset path updates look consistent.

Paths now point into the Gym workspace layout.


239-248: Trajectory‑collection toggle + multi‑env hints are clear.

Nice to expose the trajectory-only mode and the multi‑env examples.


257-260: code_gen worker scaling looks good.

The per‑node scaling via ${mul:64, ${cluster.num_nodes}} matches cluster sizing.

nemo_rl/algorithms/grpo.py (6)

74-75: MemoryTracker integration and early‑stage snapshots look solid.

Import/initialization plus the early-stage snapshots and the skip‑reference‑logprobs guard make the loop easier to observe.

Also applies to: 1044-1151


993-1015: Centralized mixed‑reward/advantage logging is great.

Wrapping histograms and derived metrics in a helper keeps logging consistent across backends.


1224-1225: Good cleanup after NeMo‑Gym rollout + mean‑gen token capture.

Deleting the rollout result early and caching mean_gen_tokens_per_sample keeps memory pressure down while preserving logging.

Also applies to: 1276-1281


1359-1400: Mixed‑reward logging call + tensor cleanup look good.

Nice to log baseline/advantage stats and then promptly delete large tensors.


1428-1456: Isolated logprob_data + conditional ref‑logprobs is tidy.

Building a minimal logprob_data and skipping reference logprobs when configured keeps memory/time down.


1458-1463: Stage snapshots, step_finished logging, and cleanup are well placed.

The added snapshots around validation/metrics/checkpointing/logging plus the step_finished timing log and final deletions improve observability and memory hygiene.

Also applies to: 1488-1501, 1521-1531, 1598-1601, 1745-1747, 1783-1789, 1795-1816

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

This PR removes Gym-workspace packaging artifacts, updates naming documentation from "Penguin" to "NeMo Gym", introduces GPU-based advantage calculation in GRPO training with memory tracking instrumentation, optimizes the packing algorithm from O(n·m) to O(n·log n), and enhances logging with conditional metrics and step-level tracking.

Changes

Cohort / File(s) Summary
Gym-workspace cleanup
3rdparty/Gym-workspace/is_nemo_gym_installed.py, 3rdparty/Gym-workspace/pyproject.toml, 3rdparty/Gym-workspace/setup.py
Removed three packaging/setup files: nemo_gym installation detector, pyproject metadata, and setup script.
Workspace configuration
pyproject.toml
Updated UV workspace path from 3rdparty/Gym-workspace to 3rdparty/Gym-workspace/Gym.
Documentation
docs/design-docs/dependency-management.md
Replaced submodule name "Penguin" with "NeMo Gym" in dependency management design doc.
Example configurations
examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
Deleted comprehensive GRPO experiment configuration (278 lines).
Example configurations
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
Added new YAML configuration for GrPO with Qwen3-30B-Instruct model, including MoE and distributed setup.
Example configurations
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml, examples/nemo_gym/launch_nemo_gym_multinode_training.sh
Updated data paths, added trajectory collection flag, enabled auto-tools and nemotron_json parser; added SLURM node count and HF_TOKEN support to launch script.
GRPO algorithm
nemo_rl/algorithms/grpo.py
Added GPU-based advantage calculation option, reference policy logprobs skip flag, Nemo-Gym response logging toggle, centralized reward/advantage logging helper, and comprehensive MemoryTracker instrumentation across training phases.
Loss functions
nemo_rl/algorithms/loss_functions.py
Made reference policy logprobs loading conditional on non-zero KL penalty.
Data packing optimization
nemo_rl/data/packing/algorithms.py
Replaced O(n·m) leftover bin packing with O(n·log n) bisect-based approach in ModifiedFirstFitDecreasingPacker.
Environment
nemo_rl/environments/nemo_gym.py
Added assertion guard to enforce dict-type nemo_gym_result at postprocessing entry.
Logging interfaces
nemo_rl/utils/logger.py
Added step_finished parameter to all log_metrics signatures across all logger backends; new log_string_list_as_jsonl() method; WandB debug mode for raylet output based on environment variable.
Memory tracking utility
nemo_rl/utils/memory_tracker.py
New module with MemoryTrackerDataPoint and MemoryTracker classes for tracking Ray driver process memory across training stages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

  • Modification of Gym-workspace packaging and integration (3rdparty/Gym-workspace/ setup files and workspace configuration updates).
  • Enhancement of GRPO algorithm logging and metrics collection in nemo_rl/algorithms/grpo.py (advantages logging, new flags).
  • Optimization of packing algorithm performance in nemo_rl/data/packing/algorithms.py (O(n·log n) vs O(n·m) refactor).

Suggested labels

CI:L1

Suggested reviewers

  • terrykong
  • yfw
  • chtruong814
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR with major algorithm and optimization changes lacks test results, performance benchmarks, or evidence tests were executed before submission. Include test execution results, before-and-after performance metrics for bin-packing optimization, convergence verification, and memory tracking validation in PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: NeMo Gym refresh 20260113' accurately describes the main change—a refresh of NeMo Gym integration with optimizations, refactoring, and new features. It is specific enough to convey the primary focus.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/utils/logger.py (1)

926-938: Update mock logger signature in test file.

The log_metrics mock at tests/unit/utils/test_logger.py:746 is missing the step_finished parameter. Update it to match the backend signatures:

def log_metrics(self, metrics, step, prefix="", step_metric=None, step_finished=False):

All backend implementations (TensorBoard, WandB, MLflow, Neptune) and the LoggerHandler already include this parameter.

🤖 Fix all issues with AI agents
In `@examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml`:
- Line 91: Update the inline comment on the moe_router_load_balancing_type
setting to correct the year typo from "Jan 06, 2025" to "Jan 06, 2026" so the
comment matches the PR date; locate the comment attached to
moe_router_load_balancing_type: none and change only the year in the comment
text.

In `@examples/nemo_gym/launch_nemo_gym_multinode_training.sh`:
- Around line 25-27: The script is currently embedding HF_TOKEN directly into
the constructed command (see HF_TOKEN in the HF_HOME/HF_TOKEN/WANDB_API_KEY
block and the echoed command at line 36); instead export or pass HF_TOKEN to
sbatch via sbatch's environment export (e.g., use --export=HF_TOKEN or export
HF_TOKEN before calling sbatch) and remove the token from the echoed command
string so it is not in logs or process listings, then reference $HF_TOKEN inside
the ray.sub environment setup rather than embedding the literal token in the
command.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 1299-1313: The code currently hardcodes device_id = 0 before
calling calculate_baseline_and_std_per_prompt; replace that with a proper device
selection that respects the current CUDA context or configuration. Use
torch.cuda.current_device() (or, if CUDA_VISIBLE_DEVICES remapping is needed,
call device_id_to_physical_device_id(torch.cuda.current_device())) to obtain the
correct device, or read a device index from master_config["grpo"] (e.g.,
"gpu_device") and map it with device_id_to_physical_device_id when present; then
pass that device to input_ids.cuda(...), rewards.cuda(...), and
torch.ones_like(...). Ensure the new logic preserves the existing branch
behavior when calculate_advantages_on_gpu is true and falls back to CPU when
CUDA is unavailable.

In `@nemo_rl/data/packing/algorithms.py`:
- Around line 631-649: The loop that packs `leftovers` leaves `ffd_bins = [[]]`
when `leftovers` is empty, causing an empty bin to be appended to `bins`; change
the logic around `ffd_bins`/`ffd_bin_sizes` (used in the loop with `for idx,
size in sorted(leftovers, ...)`, `bisect`, and `self.bin_capacity`) so you
either initialize `ffd_bins = []`/`ffd_bin_sizes = []` or skip
`bins.extend(ffd_bins)` when `ffd_bins` contains only an empty list, ensuring no
empty bin is added when `leftovers` is empty; also replace the current
pop(0)/insert usage (which makes the routine O(n^2)) with a proper O(n log n)
data structure (e.g., `heapq` or `sortedcontainers.SortedList`) for maintaining
bins by size to restore the claimed complexity; finally, either remove or add a
clear explanatory comment for the previously commented-out block around the
earlier lines so reviewers know why it was disabled.

In `@nemo_rl/utils/memory_tracker.py`:
- Line 1: Update the file header to reflect the current year by changing the
copyright year from 2025 to 2026 in the copyright header at the top of
nemo_rl/utils/memory_tracker.py; locate the top-of-file comment line that
currently reads the 2025 copyright and replace just the year with 2026 so the
header is up-to-date.
- Around line 30-40: The properties mem_used_diff_gb and new_variables can raise
TypeError when the "after" fields are None; update mem_used_diff_gb to guard
memory_used_after_stage_gb and memory_used_before_stage_gb (e.g., treat missing
values as 0.0 and return a float) and update new_variables to guard
variables_after_stage and variables_before_stage (e.g., treat None as empty
list/set and return an empty list when there are no new vars). Modify the
property implementations for mem_used_diff_gb, new_variables and reference the
fields memory_used_after_stage_gb, memory_used_before_stage_gb,
variables_after_stage, and variables_before_stage to perform these null checks
and return safe defaults.
🧹 Nitpick comments (7)
examples/nemo_gym/launch_nemo_gym_multinode_training.sh (1)

1-1: Update copyright year to 2026.

The copyright header shows 2025, but per coding guidelines it should include the current year (2026).

-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml (2)

26-26: Stale date in comment.

The comment references "Mon Oct 13" without a year. Given the PR is from January 2026, consider updating for clarity.

-  # As of Mon Oct 13, token level loss as formulated in the GSPO paper is not yet supported in NeMo RL.
+  # As of January 2026, token level loss as formulated in the GSPO paper is not yet supported in NeMo RL.

103-104: Stale date in comment.

Similar to line 26, this comment references "Mon Oct 13" without a year.

-      # As of Mon Oct 13, we default to 2e-6 here, but it's possible this value may increase/decrease depending on our subsequent observations.
+      # As of January 2026, we default to 2e-6 here, but it's possible this value may increase/decrease depending on our subsequent observations.
nemo_rl/data/packing/algorithms.py (1)

616-628: Consider removing the commented-out original implementation.

Per coding guidelines, commented-out code should include a reason for being kept, or be removed before merging. If retained for reference, add a brief note explaining its purpose (e.g., "Kept for complexity comparison" or "Fallback if issues arise"). Otherwise, remove it to reduce clutter since it's preserved in version control.

nemo_rl/utils/memory_tracker.py (1)

54-59: Declare _process as a Pydantic PrivateAttr for proper handling.

Setting _process in model_post_init without declaring it may cause issues with Pydantic's validation and serialization. Use PrivateAttr for private attributes.

Proposed fix
+from pydantic import BaseModel, Field, PrivateAttr
-from pydantic import BaseModel, Field

 class MemoryTracker(BaseModel):
     data_points: List[MemoryTrackerDataPoint] = Field(default_factory=list)
+    _process: Process = PrivateAttr()

     def model_post_init(self, context):
         self._process = Process(os.getpid())
-        return super().model_post_init(context)
nemo_rl/algorithms/grpo.py (2)

1577-1580: Consider logging level for aggregation skip message.

The print statement for skipped aggregation values is helpful for debugging but may be noisy in production. Consider using a debug-level log or removing after stabilization.


1804-1806: Simplify variable existence check.

Using "val_metrics" in dir() is unconventional. Since val_metrics is always assigned earlier in the loop (line 1130), this check is unnecessary. If you need conditional deletion, use locals() or a simple try/except.

Proposed simplification
-            if "val_metrics" in dir():
-                del val_metrics
+            del val_metrics

Or if the intent is to handle cases where it might not exist:

-            if "val_metrics" in dir():
-                del val_metrics
+            try:
+                del val_metrics
+            except NameError:
+                pass
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0b1a91 and 1a3982f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • 3rdparty/Gym-workspace/is_nemo_gym_installed.py
  • 3rdparty/Gym-workspace/pyproject.toml
  • 3rdparty/Gym-workspace/setup.py
  • docs/design-docs/dependency-management.md
  • examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/data/packing/algorithms.py
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/utils/logger.py
  • nemo_rl/utils/memory_tracker.py
  • pyproject.toml
💤 Files with no reviewable changes (4)
  • examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
  • 3rdparty/Gym-workspace/setup.py
  • 3rdparty/Gym-workspace/is_nemo_gym_installed.py
  • 3rdparty/Gym-workspace/pyproject.toml
🧰 Additional context used
📓 Path-based instructions (6)
docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section

Files:

  • docs/design-docs/dependency-management.md
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • docs/design-docs/dependency-management.md
  • nemo_rl/utils/memory_tracker.py
  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
  • nemo_rl/algorithms/loss_functions.py
  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
  • nemo_rl/algorithms/grpo.py
  • pyproject.toml
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/data/packing/algorithms.py
  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
  • nemo_rl/utils/logger.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • nemo_rl/utils/memory_tracker.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/data/packing/algorithms.py
  • nemo_rl/utils/logger.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/utils/memory_tracker.py
  • nemo_rl/algorithms/loss_functions.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/data/packing/algorithms.py
  • nemo_rl/utils/logger.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • nemo_rl/utils/memory_tracker.py
  • nemo_rl/algorithms/loss_functions.py
  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/environments/nemo_gym.py
  • nemo_rl/data/packing/algorithms.py
  • nemo_rl/utils/logger.py
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
🧠 Learnings (5)
📚 Learning: 2025-09-19T02:44:38.451Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:73-84
Timestamp: 2025-09-19T02:44:38.451Z
Learning: The scheduler configuration format with a separate "milestones: [20]" entry (not wrapped under name/kwargs) is a valid and established pattern used across GRPO, DPO, and distillation configs in the NeMo RL codebase. This format specifies transition points between different schedulers (e.g., LinearLR for warmup steps, then ConstantLR).

Applied to files:

  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • examples/nemo_gym/launch_nemo_gym_multinode_training.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.py : If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging

Applied to files:

  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (2)
nemo_rl/utils/memory_tracker.py (2)
  • MemoryTracker (54-80)
  • snapshot_start_of_stage (61-80)
nemo_rl/models/policy/interfaces.py (2)
  • get_logprobs (54-66)
  • get_reference_policy_logprobs (69-83)
nemo_rl/data/packing/algorithms.py (1)
tests/unit/data/packing/test_algorithms.py (1)
  • bin_capacity (63-65)
nemo_rl/utils/logger.py (1)
tests/unit/utils/test_logger.py (1)
  • log_metrics (746-750)
🪛 Ruff (0.14.11)
nemo_rl/utils/logger.py

148-148: Unused method argument: step_finished

(ARG002)


209-209: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.out"

(S108)


210-210: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.err"

(S108)


422-422: Unused method argument: step_finished

(ARG002)


800-800: Unused method argument: step_finished

(ARG002)

🔇 Additional comments (28)
docs/design-docs/dependency-management.md (1)

164-164: LGTM! Terminology updated consistently.

The replacement of "Penguin" with "NeMo Gym" is correctly applied in both locations, maintaining consistency throughout the documentation.

Also applies to: 196-196

examples/nemo_gym/launch_nemo_gym_multinode_training.sh (1)

40-46: LGTM - Good use of default parameter expansion.

The ${NUM_SLURM_NODES:-$NUM_ACTOR_NODES} pattern provides a sensible default when NUM_SLURM_NODES is not set, allowing flexibility while maintaining backward compatibility.

examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml (3)

17-19: Verify symmetric clipping is intentional.

Both ratio_clip_min and ratio_clip_max are set to 3e-4, which produces symmetric clipping. This is atypical - usually these define an asymmetric range. Please confirm this is the intended configuration for GSPO.


70-99: Well-structured MoE parallelism configuration.

The Megatron configuration for this MoE model is comprehensive and internally consistent:

  • tensor_model_parallel_size: 4 × context_parallel_size: 2 = 8 GPUs per node ✓
  • expert_model_parallel_size: 8 matches GPUs per node
  • sequence_parallel: true correctly enabled for expert parallel
  • Performance optimizations (moe_permute_fusion, apply_rope_fusion, bias_activation_fusion) are enabled with documented speedup expectations

147-155: Thorough checkpointing timing rationale.

The checkpoint timing configuration is well-documented with clear reasoning for the 3.5-hour checkpoint_must_save_by value given the 4-hour SLURM timeout, validation overhead, step time, and checkpoint save duration.

nemo_rl/data/packing/algorithms.py (1)

21-21: LGTM!

The bisect import is appropriate for the new sorted-insertion approach.

pyproject.toml (1)

181-181: LGTM!

The workspace member path update to 3rdparty/Gym-workspace/Gym aligns with the broader Gym submodule reorganization described in the PR.

nemo_rl/algorithms/loss_functions.py (1)

171-172: LGTM!

The conditional guard correctly avoids loading reference_policy_logprobs when KL penalty is zero. The variable is only used within the corresponding if self.reference_policy_kl_penalty != 0: block at line 275, so there's no risk of an undefined variable error.

nemo_rl/environments/nemo_gym.py (1)

151-153: LGTM!

Good defensive check to catch non-successful NeMo Gym responses early. This provides a clear error message instead of a cryptic KeyError or TypeError when accessing nested dict keys later in the function.

examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (3)

230-231: LGTM!

Data paths correctly updated to the new 3rdparty/Gym-workspace/Gym/data/ structure, consistent with the workspace reorganization.


239-248: LGTM!

The is_trajectory_collection flag is well-documented with inline guidance. The commented multi-environment config paths include a clear explanatory comment about their usage with ng_prepare_data.


257-260: LGTM!

New code_gen section properly configures num_processes using the ${mul:} resolver for scaling with cluster size.

nemo_rl/utils/logger.py (4)

102-103: LGTM!

The step_finished parameter provides a clean interface extension for signaling step completion to logging backends. The default value of False maintains backward compatibility.


204-210: Raylet log upload conditional on debug mode.

The hardcoded /tmp/ray/session_latest/logs/ path is Ray's standard log location, so this is acceptable. The conditional guard on RAY_BACKEND_LOG_LEVEL=debug ensures this only triggers during explicit debugging sessions.


366-371: LGTM!

The explicit commit=True when step_finished is set ensures W&B flushes metrics at step boundaries. The comment clearly documents the rationale for overriding W&B's default commit behavior.


975-991: LGTM!

Clean utility method for appending string lists to JSONL files. The append mode ('a') is appropriate for incremental logging across steps.

nemo_rl/utils/memory_tracker.py (1)

61-80: LGTM!

The snapshot_start_of_stage method correctly handles memory tracking by capturing RSS memory, updating the previous data point's after-stage metrics, and appending a new data point for the current stage.

nemo_rl/algorithms/grpo.py (11)

128-128: LGTM!

The new skip_reference_policy_logprobs_calculation config option is properly typed as NotRequired[bool] and enables an optimization when KL penalty is zero.


143-144: LGTM!

The calculate_advantages_on_gpu config option is properly documented and typed.


908-915: LGTM!

Simple and clear helper function for checking the NeMo Gym response logging configuration.


993-1015: LGTM!

The centralized logging helper properly uses the logger's log_histogram method (addressing the past review feedback) and calculates useful metrics for baseline rewards and advantages.


1437-1456: LGTM!

The conditional skip of reference policy logprobs calculation is properly guarded by the config check, and the early assertion (lines 1056-1057) ensures this optimization is only used when KL penalty is zero.


1783-1789: LGTM!

The step_finished=True parameter correctly signals the end of the training step for the logger, addressing the step-level tracking mentioned in the PR objectives.


1044-1044: LGTM!

The MemoryTracker is properly instantiated and used throughout the training loop to track memory usage across different phases.


1224-1230: LGTM!

The opt-out logic for NeMo Gym response logging correctly removes large full_result fields when logging is disabled, reducing runtime overhead for large rollouts.


1673-1688: LGTM!

The conditional logging correctly skips detailed train data logging when NeMo Gym response logging is disabled.


1367-1369: LGTM!

Explicit deletions after variables are no longer needed helps reduce CPU memory footprint during the training loop, as noted in the PR objectives.


1056-1060: LGTM!

The assertion ensures the skip_reference_policy_logprobs_calculation optimization is only used when KL penalty is zero, with a clear informative message.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Jan 16, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: e9191aa (PR #1773 from bxyu/nemo-gym-refresh-20260113)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@bxyu-nvidia bxyu-nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 16, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: c460020 (PR #1773 from bxyu/nemo-gym-refresh-20260113)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@bxyu-nvidia bxyu-nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 16, 2026
@bxyu-nvidia bxyu-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Jan 16, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: c5e6b33 (PR #1773 from bxyu/nemo-gym-refresh-20260113)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 346dd62 (PR #1773 from bxyu/nemo-gym-refresh-20260113)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: e015daa (PR #1773 from bxyu/nemo-gym-refresh-20260113)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Brian Yu <bxyu@nvidia.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 51b5f98 (PR #1773 from bxyu/nemo-gym-refresh-20260113)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@bxyu-nvidia bxyu-nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 17, 2026
@yfw yfw merged commit 2311de4 into main Jan 18, 2026
41 of 42 checks passed
@yfw yfw deleted the bxyu/nemo-gym-refresh-20260113 branch January 18, 2026 03:43
yfw pushed a commit that referenced this pull request Feb 9, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
xavier-owkin pushed a commit to owkin/Owkin-NeMo-RL that referenced this pull request Feb 10, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Feb 21, 2026
4 tasks
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants