Conversation
Signed-off-by: Brian Yu <bxyu@nvidia.com>
f7a293f to
1a3982f
Compare
📝 WalkthroughWalkthroughThis PR removes the Gym-workspace build infrastructure (setup.py, pyproject.toml, installation check script), refactors GRPO training with memory tracking instrumentation and improved logging, introduces a MemoryTracker utility for memory profiling, optimizes FFD bin packing from O(nm) to O(nlog n), and updates workspace and example configurations accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant Training Loop
participant MemoryTracker
participant GRPO Logic
participant Logger
participant GPU Memory
Training Loop->>MemoryTracker: snapshot_start_of_stage("Initial validation")
MemoryTracker->>GPU Memory: Query current RSS memory
MemoryTracker->>Logger: Print memory snapshot
Training Loop->>GRPO Logic: Run validation
Training Loop->>MemoryTracker: snapshot_start_of_stage("Preparing batch")
MemoryTracker->>GPU Memory: Capture memory after validation
MemoryTracker->>Logger: Log stage delta
Training Loop->>GRPO Logic: Generate sequences
Training Loop->>MemoryTracker: snapshot_start_of_stage("Generation")
alt calculate_advantages_on_gpu enabled
Training Loop->>GPU Memory: Calculate advantages on GPU
else
Training Loop->>GRPO Logic: Calculate advantages on CPU
end
Training Loop->>GRPO Logic: Compute logprobs
alt skip_reference_policy_logprobs_calculation disabled
GRPO Logic->>GRPO Logic: Extract reference policy logprobs
end
Training Loop->>GRPO Logic: Policy training
Training Loop->>Logger: Log metrics with step_finished flag
Logger->>Logger: Commit step if needed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pyproject.toml (1)
176-182: The workspace member path is incomplete—pyproject.tomlis missing.Line 181 references
3rdparty/Gym-workspace/Gymas a workspace member, but this directory does not contain apyproject.tomlfile. uv workspace members require apyproject.tomlat their root to be resolved. This will cause workspace resolution to fail.nemo_rl/utils/logger.py (1)
926-938: Update test mocklog_metricssignature to include thestep_finishedparameter.The test mock at
tests/unit/utils/test_logger.py:746has signaturedef log_metrics(self, metrics, step, prefix="", step_metric=None)but the implementation at line 938 callslogger.log_metrics(metrics, step, prefix, step_metric, step_finished)with five positional arguments. This mismatch will cause aTypeErrorat runtime when tests execute.
🤖 Fix all issues with AI agents
In `@examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml`:
- Around line 26-27: The comment above the token_level_loss setting contains a
hardcoded date ("Mon Oct 13") that will become stale; update the comment near
token_level_loss to use a general phrasing (e.g., "token-level loss not yet
supported in NeMo RL as of this writing" or remove the date entirely) so it
doesn't reference a specific calendar date and remains accurate as the code
evolves.
In `@nemo_rl/algorithms/grpo.py`:
- Around line 908-915: The helper _should_log_nemo_gym_responses currently
treats a missing env flag as False; change it to opt out only when the flag is
explicitly False and also require NeMo-Gym is in use: compute env_val =
env_config.get("should_log_nemo_gym_responses") and return
_should_use_nemo_gym(master_config) and (env_val is not False) so missing key
yields logging enabled; apply the same explicit False-check +
_should_use_nemo_gym(...) gating to the other occurrences of this pattern (the
similar blocks found elsewhere in this file).
- Around line 1292-1322: The GPU path hard-codes device_id = 0 which will fail
on CPU-only or different-GPU setups; update the block that calls
calculate_baseline_and_std_per_prompt (the branch gated by
master_config["grpo"].get("calculate_advantages_on_gpu")) to first check
torch.cuda.is_available() and otherwise fall back to CPU, and when CUDA is
available infer the target device from existing tensors (e.g., use
input_ids.device or rewards.device or set device = torch.device("cuda") if
available) and move tensors with .to(device) instead of using .cuda(0); keep the
leave_one_out_baseline logic and ensure baseline/std are moved back to CPU with
.cpu() as before if downstream expects CPU tensors.
In `@nemo_rl/data/packing/algorithms.py`:
- Around line 631-648: The current initialization of ffd_bins and ffd_bin_sizes
as [[]] and [0] causes an empty bin to be emitted when leftovers is empty;
change initialization to ffd_bins = [] and ffd_bin_sizes = [] and ensure the
loop that processes leftovers (iterating over sorted(leftovers, ...)) correctly
creates the first bin when needed (using the same new_bin/new_bin_size logic and
bisect insertion into ffd_bin_sizes) so no pre-existing empty bin is appended to
bins when leftovers is empty; update any code that assumed a starting bin to
handle empty lists accordingly (references: leftovers, ffd_bins, ffd_bin_sizes,
bisect insertion).
In `@nemo_rl/utils/memory_tracker.py`:
- Around line 30-40: The mem_used_diff_gb and new_variables properties can raise
TypeError when accessed while memory_used_after_stage_gb or
variables_after_stage are None; in the mem_used_diff_gb property (and in
new_variables) add defensive checks for None (e.g., if
self.memory_used_after_stage_gb is None or self.memory_used_before_stage_gb is
None) and either return a safe default (0.0 for mem diff, [] for new_variables)
or raise a clear ValueError explaining the field is unset; update the property
implementations in nemo_rl.utils.memory_tracker (mem_used_diff_gb and
new_variables) so callers of get_snapshot_str or direct property access get a
predictable error/return instead of a TypeError.
🧹 Nitpick comments (8)
examples/nemo_gym/launch_nemo_gym_multinode_training.sh (3)
1-1: Missing shebang line.Per Google Shell Style Guide, shell scripts should start with a shebang specifying the interpreter. This ensures consistent execution regardless of how the script is invoked.
Proposed fix
+#!/bin/bash # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
1-1: Update copyright year to 2026.As per coding guidelines, the NVIDIA copyright header should include the current year.
Proposed fix
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
19-19: Quote variable to prevent word splitting.If
REPO_LOCATIONcontains spaces or special characters, thiscdcommand will fail. Per Google Shell Style Guide, always quote variables unless you specifically need word splitting.Proposed fix
-cd $REPO_LOCATION +cd "$REPO_LOCATION"nemo_rl/data/packing/algorithms.py (1)
616-628: Consider removing or properly documenting the old implementation.The old O(n*m) implementation is preserved as a triple-quoted string, which is unconventional. Per coding guidelines, commented-out code should include a comment explaining why it's kept.
Consider either:
- Removing the old code entirely (it's in git history if needed)
- Converting to a proper comment with explanation of why it's preserved
nemo_rl/utils/memory_tracker.py (1)
54-59: Consider declaring_processas a Pydantic private attribute.The
_processattribute is assigned inmodel_post_initbut not declared as a PydanticPrivateAttr. While this works because Pydantic ignores underscore-prefixed attributes by default, explicitly declaring it improves clarity and ensures proper handling.💡 Suggested improvement
+from pydantic import BaseModel, Field, PrivateAttr + class MemoryTracker(BaseModel): data_points: List[MemoryTrackerDataPoint] = Field(default_factory=list) + _process: Process = PrivateAttr() def model_post_init(self, context): self._process = Process(os.getpid()) return super().model_post_init(context)nemo_rl/utils/logger.py (1)
975-991: Method name may be misleading - no JSON validation performed.The method
log_string_list_as_jsonlwrites raw strings to a file without validating they are valid JSON. The caller is responsible for providing properly formatted JSON strings. Consider either:
- Adding JSON validation/serialization
- Renaming to
log_string_list_to_fileor adding a docstring note that strings must be pre-formatted as JSON💡 Suggested docstring clarification
def log_string_list_as_jsonl(self, to_log: list[str], filename: str) -> None: """Log a list of strings to a JSONL file. Args: to_log: list of strings to log + Note: Each string should be a valid JSON object. + No validation is performed. filename: Filename to log to (within the log directory) """nemo_rl/algorithms/grpo.py (2)
128-145: Document new GRPOConfig fields + explicit defaults.Please add purpose/valid values/defaults for
skip_reference_policy_logprobs_calculationandcalculate_advantages_on_gpu, and surface explicit defaults in exemplar YAMLs to avoid implicit code defaults. As per coding guidelines, please document new config keys and defaults.
1577-1580: Optional: avoid noisy per‑metric prints.The new “Skipping aggregation…” prints will fire for every scalar/dict metric each step. Consider gating behind a debug flag or handling scalars explicitly to keep logs clean.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
3rdparty/Gym-workspace/is_nemo_gym_installed.py3rdparty/Gym-workspace/pyproject.toml3rdparty/Gym-workspace/setup.pydocs/design-docs/dependency-management.mdexamples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yamlexamples/nemo_gym/grpo_qwen3_30ba3b_instruct.yamlexamples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yamlexamples/nemo_gym/launch_nemo_gym_multinode_training.shnemo_rl/algorithms/grpo.pynemo_rl/algorithms/loss_functions.pynemo_rl/data/packing/algorithms.pynemo_rl/environments/nemo_gym.pynemo_rl/utils/logger.pynemo_rl/utils/memory_tracker.pypyproject.toml
💤 Files with no reviewable changes (4)
- 3rdparty/Gym-workspace/pyproject.toml
- 3rdparty/Gym-workspace/setup.py
- 3rdparty/Gym-workspace/is_nemo_gym_installed.py
- examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
🧰 Additional context used
📓 Path-based instructions (6)
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
examples/nemo_gym/launch_nemo_gym_multinode_training.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/nemo_gym/launch_nemo_gym_multinode_training.shexamples/nemo_gym/grpo_qwen3_30ba3b_instruct.yamlexamples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yamldocs/design-docs/dependency-management.mdpyproject.tomlnemo_rl/environments/nemo_gym.pynemo_rl/utils/memory_tracker.pynemo_rl/algorithms/loss_functions.pynemo_rl/algorithms/grpo.pynemo_rl/utils/logger.pynemo_rl/data/packing/algorithms.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
examples/nemo_gym/launch_nemo_gym_multinode_training.shnemo_rl/environments/nemo_gym.pynemo_rl/utils/memory_tracker.pynemo_rl/algorithms/loss_functions.pynemo_rl/algorithms/grpo.pynemo_rl/utils/logger.pynemo_rl/data/packing/algorithms.py
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section
Files:
docs/design-docs/dependency-management.md
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/environments/nemo_gym.pynemo_rl/utils/memory_tracker.pynemo_rl/algorithms/loss_functions.pynemo_rl/algorithms/grpo.pynemo_rl/utils/logger.pynemo_rl/data/packing/algorithms.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with
@ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/environments/nemo_gym.pynemo_rl/utils/memory_tracker.pynemo_rl/algorithms/loss_functions.pynemo_rl/algorithms/grpo.pynemo_rl/utils/logger.pynemo_rl/data/packing/algorithms.py
🧠 Learnings (5)
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
examples/nemo_gym/launch_nemo_gym_multinode_training.sh
📚 Learning: 2025-09-19T02:44:38.451Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:73-84
Timestamp: 2025-09-19T02:44:38.451Z
Learning: The scheduler configuration format with a separate "milestones: [20]" entry (not wrapped under name/kwargs) is a valid and established pattern used across GRPO, DPO, and distillation configs in the NeMo RL codebase. This format specifies transition points between different schedulers (e.g., LinearLR for warmup steps, then ConstantLR).
Applied to files:
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
Applied to files:
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.py : If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Applied to files:
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (4)
nemo_rl/utils/memory_tracker.py (2)
MemoryTracker(54-80)snapshot_start_of_stage(61-80)nemo_rl/algorithms/utils.py (1)
calculate_baseline_and_std_per_prompt(79-156)nemo_rl/models/policy/interfaces.py (2)
get_logprobs(54-66)get_reference_policy_logprobs(69-83)nemo_rl/distributed/batched_data_dict.py (1)
BatchedDataDict(75-860)
nemo_rl/utils/logger.py (1)
tests/unit/utils/test_logger.py (1)
log_metrics(746-750)
nemo_rl/data/packing/algorithms.py (1)
tests/unit/data/packing/test_algorithms.py (1)
bin_capacity(63-65)
🪛 Ruff (0.14.11)
nemo_rl/utils/logger.py
148-148: Unused method argument: step_finished
(ARG002)
209-209: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.out"
(S108)
210-210: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.err"
(S108)
422-422: Unused method argument: step_finished
(ARG002)
800-800: Unused method argument: step_finished
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (21)
examples/nemo_gym/launch_nemo_gym_multinode_training.sh (2)
40-46: LGTM on the optional NUM_SLURM_NODES parameter.The parameter expansion
${NUM_SLURM_NODES:-$NUM_ACTOR_NODES}correctly defaults toNUM_ACTOR_NODESwhenNUM_SLURM_NODESis unset or empty. This is a clean way to make the parameter optional while maintaining backward compatibility.
25-27: HF_TOKEN integration looks correct.The token is properly propagated to the environment for the
uv run pythoncommand. Usinguv runfollows the coding guidelines.docs/design-docs/dependency-management.md (1)
164-196: LGTM — naming alignment looks correct.
No concerns with the updated NeMo Gym references.nemo_rl/environments/nemo_gym.py (1)
151-154: LGTM! Early input validation with informative error message.The type assertion ensures that malformed or error responses from NeMo Gym are caught early with a clear, actionable error message that includes the actual value received.
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml (2)
1-11: LGTM! Well-documented configuration for MoE model GRPO training.The configuration properly leverages new PR features (
calculate_advantages_on_gpu: true) and includes helpful comments explaining the rationale for parameter choices, particularly around MoE-specific data volume requirements.
70-99: Megatron configuration is well-structured with appropriate parallelism settings.The parallelism configuration (TP=4, CP=2, EP=8) is properly documented with explanatory comments. The MoE-specific settings (router dtype, load balancing, permute fusion) follow best practices for stability.
nemo_rl/utils/memory_tracker.py (1)
42-51: LGTM! Clear and informative memory snapshot output.The formatted output includes CPU memory deltas, new variable tracking, and Ray memory summary in a readable format suitable for debugging.
nemo_rl/utils/logger.py (3)
96-105: Interface signature updated for step completion semantics.Adding
step_finished: bool = Falseto the abstract interface enables backends that support explicit step completion (like W&B) to commit metrics at appropriate boundaries. The defaultFalsemaintains backward compatibility.
366-371: LGTM! Explicit commit control for W&B step boundaries.The
step_finishedhandling with explicitcommit=Trueensures metrics are flushed at step boundaries, independent of W&B's default behavior. The comment explains the rationale well.
204-210: Hardcoded Ray session path is acceptable for debug-only feature, but consider usingos.path.realpath()for robustness.The hardcoded
/tmp/ray/session_latest/logs/path is Ray's standard session directory location and is flagged by static analysis (S108). Since this code is gated behindRAY_BACKEND_LOG_LEVEL=debug, the security risk is minimal. However, for improved robustness, consider usingos.path.realpath("/tmp/ray/session_latest")to resolve the symlink programmatically rather than hardcoding the path.nemo_rl/algorithms/loss_functions.py (1)
171-173: LGTM: conditional reference‑logprobs extraction.The guard avoids unnecessary slicing when KL penalty is zero and matches the skip flag behavior.
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (4)
204-204: Tool parser placeholder is clear.The inline note makes it obvious where to set the Nemotron Nano v2 plugin path.
230-231: Dataset path updates look consistent.Paths now point into the Gym workspace layout.
239-248: Trajectory‑collection toggle + multi‑env hints are clear.Nice to expose the trajectory-only mode and the multi‑env examples.
257-260: code_gen worker scaling looks good.The per‑node scaling via
${mul:64, ${cluster.num_nodes}}matches cluster sizing.nemo_rl/algorithms/grpo.py (6)
74-75: MemoryTracker integration and early‑stage snapshots look solid.Import/initialization plus the early-stage snapshots and the skip‑reference‑logprobs guard make the loop easier to observe.
Also applies to: 1044-1151
993-1015: Centralized mixed‑reward/advantage logging is great.Wrapping histograms and derived metrics in a helper keeps logging consistent across backends.
1224-1225: Good cleanup after NeMo‑Gym rollout + mean‑gen token capture.Deleting the rollout result early and caching
mean_gen_tokens_per_samplekeeps memory pressure down while preserving logging.Also applies to: 1276-1281
1359-1400: Mixed‑reward logging call + tensor cleanup look good.Nice to log baseline/advantage stats and then promptly delete large tensors.
1428-1456: Isolated logprob_data + conditional ref‑logprobs is tidy.Building a minimal logprob_data and skipping reference logprobs when configured keeps memory/time down.
1458-1463: Stage snapshots, step_finished logging, and cleanup are well placed.The added snapshots around validation/metrics/checkpointing/logging plus the step_finished timing log and final deletions improve observability and memory hygiene.
Also applies to: 1488-1501, 1521-1531, 1598-1601, 1745-1747, 1783-1789, 1795-1816
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
📝 WalkthroughWalkthroughThis PR removes Gym-workspace packaging artifacts, updates naming documentation from "Penguin" to "NeMo Gym", introduces GPU-based advantage calculation in GRPO training with memory tracking instrumentation, optimizes the packing algorithm from O(n·m) to O(n·log n), and enhances logging with conditional metrics and step-level tracking. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/utils/logger.py (1)
926-938: Update mock logger signature in test file.The
log_metricsmock attests/unit/utils/test_logger.py:746is missing thestep_finishedparameter. Update it to match the backend signatures:def log_metrics(self, metrics, step, prefix="", step_metric=None, step_finished=False):All backend implementations (TensorBoard, WandB, MLflow, Neptune) and the LoggerHandler already include this parameter.
🤖 Fix all issues with AI agents
In `@examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml`:
- Line 91: Update the inline comment on the moe_router_load_balancing_type
setting to correct the year typo from "Jan 06, 2025" to "Jan 06, 2026" so the
comment matches the PR date; locate the comment attached to
moe_router_load_balancing_type: none and change only the year in the comment
text.
In `@examples/nemo_gym/launch_nemo_gym_multinode_training.sh`:
- Around line 25-27: The script is currently embedding HF_TOKEN directly into
the constructed command (see HF_TOKEN in the HF_HOME/HF_TOKEN/WANDB_API_KEY
block and the echoed command at line 36); instead export or pass HF_TOKEN to
sbatch via sbatch's environment export (e.g., use --export=HF_TOKEN or export
HF_TOKEN before calling sbatch) and remove the token from the echoed command
string so it is not in logs or process listings, then reference $HF_TOKEN inside
the ray.sub environment setup rather than embedding the literal token in the
command.
In `@nemo_rl/algorithms/grpo.py`:
- Around line 1299-1313: The code currently hardcodes device_id = 0 before
calling calculate_baseline_and_std_per_prompt; replace that with a proper device
selection that respects the current CUDA context or configuration. Use
torch.cuda.current_device() (or, if CUDA_VISIBLE_DEVICES remapping is needed,
call device_id_to_physical_device_id(torch.cuda.current_device())) to obtain the
correct device, or read a device index from master_config["grpo"] (e.g.,
"gpu_device") and map it with device_id_to_physical_device_id when present; then
pass that device to input_ids.cuda(...), rewards.cuda(...), and
torch.ones_like(...). Ensure the new logic preserves the existing branch
behavior when calculate_advantages_on_gpu is true and falls back to CPU when
CUDA is unavailable.
In `@nemo_rl/data/packing/algorithms.py`:
- Around line 631-649: The loop that packs `leftovers` leaves `ffd_bins = [[]]`
when `leftovers` is empty, causing an empty bin to be appended to `bins`; change
the logic around `ffd_bins`/`ffd_bin_sizes` (used in the loop with `for idx,
size in sorted(leftovers, ...)`, `bisect`, and `self.bin_capacity`) so you
either initialize `ffd_bins = []`/`ffd_bin_sizes = []` or skip
`bins.extend(ffd_bins)` when `ffd_bins` contains only an empty list, ensuring no
empty bin is added when `leftovers` is empty; also replace the current
pop(0)/insert usage (which makes the routine O(n^2)) with a proper O(n log n)
data structure (e.g., `heapq` or `sortedcontainers.SortedList`) for maintaining
bins by size to restore the claimed complexity; finally, either remove or add a
clear explanatory comment for the previously commented-out block around the
earlier lines so reviewers know why it was disabled.
In `@nemo_rl/utils/memory_tracker.py`:
- Line 1: Update the file header to reflect the current year by changing the
copyright year from 2025 to 2026 in the copyright header at the top of
nemo_rl/utils/memory_tracker.py; locate the top-of-file comment line that
currently reads the 2025 copyright and replace just the year with 2026 so the
header is up-to-date.
- Around line 30-40: The properties mem_used_diff_gb and new_variables can raise
TypeError when the "after" fields are None; update mem_used_diff_gb to guard
memory_used_after_stage_gb and memory_used_before_stage_gb (e.g., treat missing
values as 0.0 and return a float) and update new_variables to guard
variables_after_stage and variables_before_stage (e.g., treat None as empty
list/set and return an empty list when there are no new vars). Modify the
property implementations for mem_used_diff_gb, new_variables and reference the
fields memory_used_after_stage_gb, memory_used_before_stage_gb,
variables_after_stage, and variables_before_stage to perform these null checks
and return safe defaults.
🧹 Nitpick comments (7)
examples/nemo_gym/launch_nemo_gym_multinode_training.sh (1)
1-1: Update copyright year to 2026.The copyright header shows 2025, but per coding guidelines it should include the current year (2026).
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2025-2026, NVIDIA CORPORATION. All rights reserved.examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml (2)
26-26: Stale date in comment.The comment references "Mon Oct 13" without a year. Given the PR is from January 2026, consider updating for clarity.
- # As of Mon Oct 13, token level loss as formulated in the GSPO paper is not yet supported in NeMo RL. + # As of January 2026, token level loss as formulated in the GSPO paper is not yet supported in NeMo RL.
103-104: Stale date in comment.Similar to line 26, this comment references "Mon Oct 13" without a year.
- # As of Mon Oct 13, we default to 2e-6 here, but it's possible this value may increase/decrease depending on our subsequent observations. + # As of January 2026, we default to 2e-6 here, but it's possible this value may increase/decrease depending on our subsequent observations.nemo_rl/data/packing/algorithms.py (1)
616-628: Consider removing the commented-out original implementation.Per coding guidelines, commented-out code should include a reason for being kept, or be removed before merging. If retained for reference, add a brief note explaining its purpose (e.g., "Kept for complexity comparison" or "Fallback if issues arise"). Otherwise, remove it to reduce clutter since it's preserved in version control.
nemo_rl/utils/memory_tracker.py (1)
54-59: Declare_processas a Pydantic PrivateAttr for proper handling.Setting
_processinmodel_post_initwithout declaring it may cause issues with Pydantic's validation and serialization. UsePrivateAttrfor private attributes.Proposed fix
+from pydantic import BaseModel, Field, PrivateAttr -from pydantic import BaseModel, Field class MemoryTracker(BaseModel): data_points: List[MemoryTrackerDataPoint] = Field(default_factory=list) + _process: Process = PrivateAttr() def model_post_init(self, context): self._process = Process(os.getpid()) - return super().model_post_init(context)nemo_rl/algorithms/grpo.py (2)
1577-1580: Consider logging level for aggregation skip message.The print statement for skipped aggregation values is helpful for debugging but may be noisy in production. Consider using a debug-level log or removing after stabilization.
1804-1806: Simplify variable existence check.Using
"val_metrics" in dir()is unconventional. Sinceval_metricsis always assigned earlier in the loop (line 1130), this check is unnecessary. If you need conditional deletion, uselocals()or a simple try/except.Proposed simplification
- if "val_metrics" in dir(): - del val_metrics + del val_metricsOr if the intent is to handle cases where it might not exist:
- if "val_metrics" in dir(): - del val_metrics + try: + del val_metrics + except NameError: + pass
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
3rdparty/Gym-workspace/is_nemo_gym_installed.py3rdparty/Gym-workspace/pyproject.toml3rdparty/Gym-workspace/setup.pydocs/design-docs/dependency-management.mdexamples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yamlexamples/nemo_gym/grpo_qwen3_30ba3b_instruct.yamlexamples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yamlexamples/nemo_gym/launch_nemo_gym_multinode_training.shnemo_rl/algorithms/grpo.pynemo_rl/algorithms/loss_functions.pynemo_rl/data/packing/algorithms.pynemo_rl/environments/nemo_gym.pynemo_rl/utils/logger.pynemo_rl/utils/memory_tracker.pypyproject.toml
💤 Files with no reviewable changes (4)
- examples/nemo_gym/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
- 3rdparty/Gym-workspace/setup.py
- 3rdparty/Gym-workspace/is_nemo_gym_installed.py
- 3rdparty/Gym-workspace/pyproject.toml
🧰 Additional context used
📓 Path-based instructions (6)
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section
Files:
docs/design-docs/dependency-management.md
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
docs/design-docs/dependency-management.mdnemo_rl/utils/memory_tracker.pyexamples/nemo_gym/grpo_qwen3_30ba3b_instruct.yamlnemo_rl/algorithms/loss_functions.pyexamples/nemo_gym/launch_nemo_gym_multinode_training.shnemo_rl/algorithms/grpo.pypyproject.tomlnemo_rl/environments/nemo_gym.pynemo_rl/data/packing/algorithms.pyexamples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yamlnemo_rl/utils/logger.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/utils/memory_tracker.pynemo_rl/algorithms/loss_functions.pynemo_rl/algorithms/grpo.pynemo_rl/environments/nemo_gym.pynemo_rl/data/packing/algorithms.pynemo_rl/utils/logger.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with
@ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/utils/memory_tracker.pynemo_rl/algorithms/loss_functions.pynemo_rl/algorithms/grpo.pynemo_rl/environments/nemo_gym.pynemo_rl/data/packing/algorithms.pynemo_rl/utils/logger.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
nemo_rl/utils/memory_tracker.pynemo_rl/algorithms/loss_functions.pyexamples/nemo_gym/launch_nemo_gym_multinode_training.shnemo_rl/algorithms/grpo.pynemo_rl/environments/nemo_gym.pynemo_rl/data/packing/algorithms.pynemo_rl/utils/logger.py
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
examples/nemo_gym/launch_nemo_gym_multinode_training.sh
🧠 Learnings (5)
📚 Learning: 2025-09-19T02:44:38.451Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:73-84
Timestamp: 2025-09-19T02:44:38.451Z
Learning: The scheduler configuration format with a separate "milestones: [20]" entry (not wrapped under name/kwargs) is a valid and established pattern used across GRPO, DPO, and distillation configs in the NeMo RL codebase. This format specifies transition points between different schedulers (e.g., LinearLR for warmup steps, then ConstantLR).
Applied to files:
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
Applied to files:
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
examples/nemo_gym/launch_nemo_gym_multinode_training.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.py : If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Applied to files:
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (2)
nemo_rl/utils/memory_tracker.py (2)
MemoryTracker(54-80)snapshot_start_of_stage(61-80)nemo_rl/models/policy/interfaces.py (2)
get_logprobs(54-66)get_reference_policy_logprobs(69-83)
nemo_rl/data/packing/algorithms.py (1)
tests/unit/data/packing/test_algorithms.py (1)
bin_capacity(63-65)
nemo_rl/utils/logger.py (1)
tests/unit/utils/test_logger.py (1)
log_metrics(746-750)
🪛 Ruff (0.14.11)
nemo_rl/utils/logger.py
148-148: Unused method argument: step_finished
(ARG002)
209-209: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.out"
(S108)
210-210: Probable insecure usage of temporary file or directory: "/tmp/ray/session_latest/logs/raylet.err"
(S108)
422-422: Unused method argument: step_finished
(ARG002)
800-800: Unused method argument: step_finished
(ARG002)
🔇 Additional comments (28)
docs/design-docs/dependency-management.md (1)
164-164: LGTM! Terminology updated consistently.The replacement of "Penguin" with "NeMo Gym" is correctly applied in both locations, maintaining consistency throughout the documentation.
Also applies to: 196-196
examples/nemo_gym/launch_nemo_gym_multinode_training.sh (1)
40-46: LGTM - Good use of default parameter expansion.The
${NUM_SLURM_NODES:-$NUM_ACTOR_NODES}pattern provides a sensible default whenNUM_SLURM_NODESis not set, allowing flexibility while maintaining backward compatibility.examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml (3)
17-19: Verify symmetric clipping is intentional.Both
ratio_clip_minandratio_clip_maxare set to3e-4, which produces symmetric clipping. This is atypical - usually these define an asymmetric range. Please confirm this is the intended configuration for GSPO.
70-99: Well-structured MoE parallelism configuration.The Megatron configuration for this MoE model is comprehensive and internally consistent:
tensor_model_parallel_size: 4×context_parallel_size: 2= 8 GPUs per node ✓expert_model_parallel_size: 8matches GPUs per nodesequence_parallel: truecorrectly enabled for expert parallel- Performance optimizations (
moe_permute_fusion,apply_rope_fusion,bias_activation_fusion) are enabled with documented speedup expectations
147-155: Thorough checkpointing timing rationale.The checkpoint timing configuration is well-documented with clear reasoning for the 3.5-hour
checkpoint_must_save_byvalue given the 4-hour SLURM timeout, validation overhead, step time, and checkpoint save duration.nemo_rl/data/packing/algorithms.py (1)
21-21: LGTM!The
bisectimport is appropriate for the new sorted-insertion approach.pyproject.toml (1)
181-181: LGTM!The workspace member path update to
3rdparty/Gym-workspace/Gymaligns with the broader Gym submodule reorganization described in the PR.nemo_rl/algorithms/loss_functions.py (1)
171-172: LGTM!The conditional guard correctly avoids loading
reference_policy_logprobswhen KL penalty is zero. The variable is only used within the correspondingif self.reference_policy_kl_penalty != 0:block at line 275, so there's no risk of an undefined variable error.nemo_rl/environments/nemo_gym.py (1)
151-153: LGTM!Good defensive check to catch non-successful NeMo Gym responses early. This provides a clear error message instead of a cryptic
KeyErrororTypeErrorwhen accessing nested dict keys later in the function.examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (3)
230-231: LGTM!Data paths correctly updated to the new
3rdparty/Gym-workspace/Gym/data/structure, consistent with the workspace reorganization.
239-248: LGTM!The
is_trajectory_collectionflag is well-documented with inline guidance. The commented multi-environment config paths include a clear explanatory comment about their usage withng_prepare_data.
257-260: LGTM!New
code_gensection properly configuresnum_processesusing the${mul:}resolver for scaling with cluster size.nemo_rl/utils/logger.py (4)
102-103: LGTM!The
step_finishedparameter provides a clean interface extension for signaling step completion to logging backends. The default value ofFalsemaintains backward compatibility.
204-210: Raylet log upload conditional on debug mode.The hardcoded
/tmp/ray/session_latest/logs/path is Ray's standard log location, so this is acceptable. The conditional guard onRAY_BACKEND_LOG_LEVEL=debugensures this only triggers during explicit debugging sessions.
366-371: LGTM!The explicit
commit=Truewhenstep_finishedis set ensures W&B flushes metrics at step boundaries. The comment clearly documents the rationale for overriding W&B's default commit behavior.
975-991: LGTM!Clean utility method for appending string lists to JSONL files. The append mode (
'a') is appropriate for incremental logging across steps.nemo_rl/utils/memory_tracker.py (1)
61-80: LGTM!The
snapshot_start_of_stagemethod correctly handles memory tracking by capturing RSS memory, updating the previous data point's after-stage metrics, and appending a new data point for the current stage.nemo_rl/algorithms/grpo.py (11)
128-128: LGTM!The new
skip_reference_policy_logprobs_calculationconfig option is properly typed asNotRequired[bool]and enables an optimization when KL penalty is zero.
143-144: LGTM!The
calculate_advantages_on_gpuconfig option is properly documented and typed.
908-915: LGTM!Simple and clear helper function for checking the NeMo Gym response logging configuration.
993-1015: LGTM!The centralized logging helper properly uses the logger's
log_histogrammethod (addressing the past review feedback) and calculates useful metrics for baseline rewards and advantages.
1437-1456: LGTM!The conditional skip of reference policy logprobs calculation is properly guarded by the config check, and the early assertion (lines 1056-1057) ensures this optimization is only used when KL penalty is zero.
1783-1789: LGTM!The
step_finished=Trueparameter correctly signals the end of the training step for the logger, addressing the step-level tracking mentioned in the PR objectives.
1044-1044: LGTM!The MemoryTracker is properly instantiated and used throughout the training loop to track memory usage across different phases.
1224-1230: LGTM!The opt-out logic for NeMo Gym response logging correctly removes large
full_resultfields when logging is disabled, reducing runtime overhead for large rollouts.
1673-1688: LGTM!The conditional logging correctly skips detailed train data logging when NeMo Gym response logging is disabled.
1367-1369: LGTM!Explicit deletions after variables are no longer needed helps reduce CPU memory footprint during the training loop, as noted in the PR objectives.
1056-1060: LGTM!The assertion ensures the
skip_reference_policy_logprobs_calculationoptimization is only used when KL penalty is zero, with a clear informative message.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
What does this PR do ?
step_finishedlogic in nemo_rl/utils/logger.py and nemo_rl/algorithms/grpo.pylog_string_list_as_jsonllogic in nemo_rl/utils/logger.pyshould_log_nemo_gym_responseslogic in nemo_rl/algorithms/grpo.pyskip_reference_policy_logprobs_calculationlogic in nemo_rl/algorithms/grpo.pycalculate_advantages_on_gpuin nemo_rl/algorithms/grpo.pydel *variable deletions in nemo_rl/algorithms/grpo.pybaseline_reward/pct_0related logic in nemo_rl/algorithms/grpo.pyIssues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.