Conversation
📝 WalkthroughWalkthroughThis PR introduces GRPO training support for the Penguin environment in NeMo RL. Changes include a YAML configuration file, a main orchestration script for GRPO training, sanity test automation, tokenizer and timing integration into Penguin rollout APIs, timing metric collection in rollout processing, and corresponding test updates. Changes
Sequence DiagramsequenceDiagram
participant User
participant MainScript as run_grpo_penguin.py
participant Config as Hydra Config
participant Dataset as Datasets
participant Ray
participant PenguinActor
participant TrainingLoop as Training/Collection
User->>MainScript: python run_grpo_penguin.py --config=...
MainScript->>Config: Load YAML + parse overrides
Config-->>MainScript: MasterConfig
MainScript->>Dataset: setup_single_penguin_dataset()
Dataset-->>MainScript: train_dataset, val_dataset
MainScript->>Ray: init_ray()
Ray-->>MainScript: Ray initialized
MainScript->>MainScript: setup(config, datasets...)
MainScript-->>MainScript: policy, dataloader, loss_fn, etc.
alt is_trajectory_collection == True
MainScript->>PenguinActor: Create with runtime_env
PenguinActor-->>MainScript: health check passed
MainScript->>MainScript: collect_trajectories()
MainScript->>PenguinActor: run_async_penguin_rollout(tokenizer, timer_prefix)
PenguinActor-->>MainScript: rollout results + timing_metrics
MainScript->>Dataset: Log to trajectory_collection.jsonl
else
MainScript->>TrainingLoop: grpo_train(policy, dataloader, ...)
TrainingLoop-->>MainScript: training complete
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
examples/penguin/run_grpo_penguin.py (1)
216-222: Consider extracting long error message to exception class or docstring.While the detailed error message is helpful, consider moving it to a custom exception class or module-level constant for better maintainability.
Example:
class UnsupportedPenguinConfigError(ValueError): """A non-null `grpo.max_val_samples` parameter is not supported. Gym principle is that there is no hidden data pre or post processing from you. What you see is what you get. The validation set you pass in will directly be used for validation with no additional preprocessing. If you want to have some number of repetitions, please include that in your dataset, via `num_repeats`, in your dataset config and `ng_prepare_data` will prepare it accordingly. """ # Then use: raise UnsupportedPenguinConfigError()nemo_rl/experience/rollouts.py (1)
1036-1102: LGTM! Metric aggregation and per-agent logging properly implemented.The timing instrumentation around metric aggregation and per-agent metric calculation provides good observability. The full result logging via wandb.Table enables detailed debugging.
Consider adding
strict=Trueto the zip on line 1076 for defensive programming, though the paired nature ofpenguin_rowsandresultsmakes this low risk:- for penguin_row, result in zip(penguin_rows, results): + for penguin_row, result in zip(penguin_rows, results, strict=True):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml(1 hunks)examples/penguin/run_grpo_penguin.py(1 hunks)examples/penguin/run_penguin_single_node_sanity_tests.sh(1 hunks)nemo_rl/environments/penguin.py(4 hunks)nemo_rl/experience/rollouts.py(3 hunks)tests/unit/environments/test_penguin.py(3 hunks)tests/unit/experience/test_rollouts.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/experience/rollouts.pynemo_rl/environments/penguin.pytests/unit/experience/test_rollouts.pytests/unit/environments/test_penguin.pyexamples/penguin/run_grpo_penguin.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/experience/rollouts.pynemo_rl/environments/penguin.py
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Useuv runto execute Python scripts in shell/driver scripts instead of activating virtualenvs and callingpythondirectly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts
Files:
examples/penguin/run_penguin_single_node_sanity_tests.sh
🧠 Learnings (5)
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
Applied to files:
nemo_rl/experience/rollouts.pyexamples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Applied to files:
nemo_rl/environments/penguin.py
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.
Applied to files:
examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
examples/penguin/run_penguin_single_node_sanity_tests.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
examples/penguin/run_penguin_single_node_sanity_tests.sh
🧬 Code graph analysis (5)
nemo_rl/experience/rollouts.py (2)
nemo_rl/utils/timer.py (5)
Timer(22-248)start(79-83)time(110-123)stop(85-107)get_timing_metrics(196-233)nemo_rl/environments/penguin.py (1)
run_rollouts(106-138)
nemo_rl/environments/penguin.py (1)
nemo_rl/utils/timer.py (3)
Timer(22-248)time(110-123)get_timing_metrics(196-233)
examples/penguin/run_penguin_single_node_sanity_tests.sh (1)
tests/unit/environments/test_penguin.py (1)
penguin(80-130)
tests/unit/environments/test_penguin.py (1)
nemo_rl/environments/penguin.py (1)
run_rollouts(106-138)
examples/penguin/run_grpo_penguin.py (13)
nemo_rl/models/policy/interfaces.py (1)
ColocatablePolicyInterface(141-168)nemo_rl/models/generation/interfaces.py (1)
GenerationInterface(215-254)nemo_rl/utils/logger.py (2)
Logger(804-1039)get_next_experiment_dir(1328-1362)nemo_rl/algorithms/grpo.py (2)
_should_use_penguin(728-748)refit_policy_generation(751-822)nemo_rl/algorithms/utils.py (1)
get_tokenizer(184-315)nemo_rl/data/datasets/processed_dataset.py (1)
AllTaskProcessedDataset(31-126)nemo_rl/data/interfaces.py (1)
DatumSpec(32-40)nemo_rl/distributed/ray_actor_environment_registry.py (1)
get_actor_python_env(49-64)nemo_rl/distributed/virtual_cluster.py (1)
init_ray(85-171)nemo_rl/environments/penguin.py (5)
Penguin(34-209)PenguinConfig(27-30)penguin_example_to_nemo_rl_datum_spec(235-248)setup_penguin_config(217-226)health_check(103-104)nemo_rl/experience/rollouts.py (1)
run_async_penguin_rollout(958-1145)nemo_rl/models/generation/__init__.py (1)
configure_generation_config(25-54)nemo_rl/utils/config.py (1)
parse_hydra_overrides(146-166)
🪛 Ruff (0.14.3)
nemo_rl/experience/rollouts.py
1076-1076: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
examples/penguin/run_grpo_penguin.py
103-103: Do not assign a lambda expression, use a def
Rewrite passthrough_task_processor as a def
(E731)
103-103: Unused lambda argument: args
(ARG005)
103-103: Unused lambda argument: kwargs
(ARG005)
216-222: Avoid specifying long messages outside the exception class
(TRY003)
239-239: Unpacked variable cluster is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🪛 Shellcheck (0.11.0)
examples/penguin/run_penguin_single_node_sanity_tests.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (12)
tests/unit/experience/test_rollouts.py (1)
785-792: LGTM! Timing metrics properly integrated.The new timing metric keys align with the Timer instrumentation added to the rollout flow. The test correctly normalizes these values to None during comparison, avoiding flakiness from non-deterministic timing.
tests/unit/environments/test_penguin.py (2)
145-166: LGTM! Test properly updated for new signature.The test correctly passes the
penguin_tokenizerfixture torun_rolloutsand handles the updated return type (tuple of results and timing metrics). The empty timer_prefix is appropriate for test context.
169-191: LGTM! Proper normalization for reproducible tests.The token_ids conversion and dummy value substitution for
prompt_str/generation_strfields ensure deterministic test comparisons while accommodating the new decoded string fields added to penguin results.nemo_rl/experience/rollouts.py (4)
51-51: LGTM! Timer instrumentation properly initialized.The Timer import and initialization follow the correct pattern. The
timer_prefixconvention ("timing/rollout") provides clear namespacing for timing metrics.Also applies to: 991-993
1007-1013: LGTM! Rollout timing properly captured.The timer context around
penguin_environment.run_rolloutscorrectly measures end-to-end rollout time, and the tuple unpacking matches the updated Penguin environment signature.
1016-1033: LGTM! Metric preparation properly timed.The timer context around metric preparation provides visibility into preprocessing overhead. The metrics calculation correctly extracts token counts, rewards, and turns from Penguin results.
1108-1109: LGTM! Timer properly finalized and merged.The timer is correctly stopped and timing metrics are merged into
rollout_metricsusing sum reduction, which is appropriate for cumulative timing measurements across the batch.examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml (1)
1-267: LGTM! Comprehensive GRPO configuration for Penguin.The configuration properly sets up GRPO training with the Qwen3-4B-Instruct model for Penguin integration. Key settings are correctly configured:
async_engine: trueandexpose_http_server: true(required for Penguin)max_new_tokensset to full context length (per codebase pattern)- Tool parser configured for Qwen 3 4B Instruct
- Data paths point to Penguin workspace
The configuration aligns with the new Penguin integration in the codebase.
Based on learnings
nemo_rl/environments/penguin.py (4)
19-19: LGTM! Required imports for new functionality.The
PreTrainedTokenizerBaseimport enables tokenizer-aware postprocessing, and theTimerimport supports the new timing instrumentation in rollout execution.Also applies to: 24-24
51-68: LGTM! More defensive config handling with helpful diagnostics.The use of
get()with a fallback andsetdefault()for aiohttp limits makes the initialization more robust. The diagnostic print provides visibility into connection limit settings, which is helpful for debugging concurrency issues.
106-138: LGTM! Comprehensive timing instrumentation added.The updated signature properly accepts a tokenizer and timer_prefix, enabling downstream string decoding and timing metrics. The timing instrumentation provides excellent observability:
- Per-task await and postprocessing times
- Total time tracking
- Postprocess percentage calculation
The tuple return
(nemo_rl_results, timing_metrics)is properly handled by callers.
140-198: LGTM! Tokenizer integration with memory optimization.The postprocessing now accepts a tokenizer to decode token IDs into human-readable strings. The approach of:
- Converting token IDs to torch tensors for NeMo RL compatibility
- Decoding to strings for logging
- Popping large tensor fields from the result
...provides a good balance between functionality and memory efficiency. The
prompt_strandgeneration_strfields improve debugging without bloating logs with full token ID tensors.
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
This PR:
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit