Skip to content

feat: Add Penguin run#1481

Merged
terrykong merged 5 commits intomainfrom
bxyu/run-penguin
Nov 7, 2025
Merged

feat: Add Penguin run#1481
terrykong merged 5 commits intomainfrom
bxyu/run-penguin

Conversation

@bxyu-nvidia
Copy link
Contributor

@bxyu-nvidia bxyu-nvidia commented Nov 6, 2025

What does this PR do ?

This PR:

  1. Contributes the run_grpo_penguin.py example file along with one example config for Qwen 3 4B Instruct
  2. Adds the Penguin sanity test util file
  3. Updates the Penguin code and corresponding tests

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added GRPO training support with configuration and training script for the Qwen-3-4B-Instruct model
    • Integrated tokenizer handling for improved generation and result processing
    • Added performance timing instrumentation to rollout operations
    • Added sanity test suite for Penguin environment validation

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@bxyu-nvidia bxyu-nvidia requested review from a team as code owners November 6, 2025 18:43
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 6, 2025

📝 Walkthrough

Walkthrough

This PR introduces GRPO training support for the Penguin environment in NeMo RL. Changes include a YAML configuration file, a main orchestration script for GRPO training, sanity test automation, tokenizer and timing integration into Penguin rollout APIs, timing metric collection in rollout processing, and corresponding test updates.

Changes

Cohort / File(s) Summary
Configuration and Examples
examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
New YAML configuration file defining GRPO training hyperparameters, model and tokenizer settings, loss functions, optimizer/scheduler configuration, generation backend (vLLM), dataset paths, and cluster hardware specifications.
Training and Orchestration
examples/penguin/run_grpo_penguin.py
New Python script orchestrating GRPO training with Penguin; includes CLI arg parsing, config loading, tokenizer/generation setup, Penguin-specific config application, dataset preparation (train/validation from JSONL), Ray initialization, Penguin actor creation with health checks, and branching between trajectory collection and training execution paths.
Sanity Testing Automation
examples/penguin/run_penguin_single_node_sanity_tests.sh
New shell script automating sanity test execution; runs environment sync, Ray cleanup, and comprehensive unit tests covering vLLM generation, tokenization, math environments, NeMo Gym, and end-to-end Penguin rollout validation.
Penguin Environment API Updates
nemo_rl/environments/penguin.py
Updated run_rollouts method signature to accept tokenizer and timer_prefix; now returns tuple of (results, timing_metrics); updated _postprocess_penguin_to_nemo_rl_result to accept tokenizer for token-id decoding; hardened global config defaults and added diagnostic logging.
Rollout Timing Instrumentation
nemo_rl/experience/rollouts.py
Added Timer-based instrumentation to run_async_penguin_rollout; wraps metric preparation, aggregation, and per-agent metrics sections with timing context; timing metrics merged into rollout_metrics; removed _tensorize_by_key utility; enhanced per-agent result logging.
Test Updates
tests/unit/environments/test_penguin.py, tests/unit/experience/test_rollouts.py
Updated test_penguin_sanity to accept penguin_tokenizer and handle tuple return from run_rollouts; converts token_ids to lists; standardizes prompt_str/generation_str fields; updated test expectations to include timing metrics under rollout_metrics.

Sequence Diagram

sequenceDiagram
    participant User
    participant MainScript as run_grpo_penguin.py
    participant Config as Hydra Config
    participant Dataset as Datasets
    participant Ray
    participant PenguinActor
    participant TrainingLoop as Training/Collection

    User->>MainScript: python run_grpo_penguin.py --config=...
    MainScript->>Config: Load YAML + parse overrides
    Config-->>MainScript: MasterConfig
    MainScript->>Dataset: setup_single_penguin_dataset()
    Dataset-->>MainScript: train_dataset, val_dataset
    MainScript->>Ray: init_ray()
    Ray-->>MainScript: Ray initialized
    MainScript->>MainScript: setup(config, datasets...)
    MainScript-->>MainScript: policy, dataloader, loss_fn, etc.
    
    alt is_trajectory_collection == True
        MainScript->>PenguinActor: Create with runtime_env
        PenguinActor-->>MainScript: health check passed
        MainScript->>MainScript: collect_trajectories()
        MainScript->>PenguinActor: run_async_penguin_rollout(tokenizer, timer_prefix)
        PenguinActor-->>MainScript: rollout results + timing_metrics
        MainScript->>Dataset: Log to trajectory_collection.jsonl
    else
        MainScript->>TrainingLoop: grpo_train(policy, dataloader, ...)
        TrainingLoop-->>MainScript: training complete
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

  • API signature changes in penguin.py: run_rollouts now requires tokenizer and timer_prefix parameters; return type changed from list[dict] to tuple[list[dict], dict]; _postprocess_penguin_to_nemo_rl_result updated to accept tokenizer—ensure callers are updated consistently
  • Orchestration complexity in run_grpo_penguin.py: orchestrates multiple components (Ray, Penguin actor, config, datasets); branching logic between trajectory collection and training paths requires careful validation
  • Trajectory collection logic: dataset repeat patterns, JSONL loading and logging, async rollout handling with tokenizer integration
  • Timing instrumentation: ensure timer context blocks capture all relevant phases and don't introduce performance regressions

Possibly related PRs

  • PR #1327: Introduces Penguin environment and registration—foundational work that this PR builds upon by extending Penguin APIs
  • PR #1450: Modifies Penguin integration paths in nemo_rl/environments/penguin.py and rollout handling; overlapping changes to the same core modules

Suggested labels

CI:L0

Suggested reviewers

  • parthchadha
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning PR implements major feature additions with public API changes but PR description is empty template with no documented test results, performance data, or regression testing confirmation. Update PR description to document test execution results, confirm existing unit tests pass with no regressions, validate performance/numeric impact from timing changes, and provide CI/CD pipeline results demonstrating successful test execution.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Add Penguin run' directly relates to the changeset's main objective of adding Penguin training and evaluation functionality, including new configuration files, scripts, and environment setup code.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bxyu/run-penguin

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
examples/penguin/run_grpo_penguin.py (1)

216-222: Consider extracting long error message to exception class or docstring.

While the detailed error message is helpful, consider moving it to a custom exception class or module-level constant for better maintainability.

Example:

class UnsupportedPenguinConfigError(ValueError):
    """A non-null `grpo.max_val_samples` parameter is not supported.
    
    Gym principle is that there is no hidden data pre or post processing from you.
    What you see is what you get.
    
    The validation set you pass in will directly be used for validation with no
    additional preprocessing. If you want to have some number of repetitions,
    please include that in your dataset, via `num_repeats`, in your dataset
    config and `ng_prepare_data` will prepare it accordingly.
    """

# Then use:
raise UnsupportedPenguinConfigError()
nemo_rl/experience/rollouts.py (1)

1036-1102: LGTM! Metric aggregation and per-agent logging properly implemented.

The timing instrumentation around metric aggregation and per-agent metric calculation provides good observability. The full result logging via wandb.Table enables detailed debugging.

Consider adding strict=True to the zip on line 1076 for defensive programming, though the paired nature of penguin_rows and results makes this low risk:

-        for penguin_row, result in zip(penguin_rows, results):
+        for penguin_row, result in zip(penguin_rows, results, strict=True):
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6984ba7 and bd84c00.

📒 Files selected for processing (7)
  • examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml (1 hunks)
  • examples/penguin/run_grpo_penguin.py (1 hunks)
  • examples/penguin/run_penguin_single_node_sanity_tests.sh (1 hunks)
  • nemo_rl/environments/penguin.py (4 hunks)
  • nemo_rl/experience/rollouts.py (3 hunks)
  • tests/unit/environments/test_penguin.py (3 hunks)
  • tests/unit/experience/test_rollouts.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/experience/rollouts.py
  • nemo_rl/environments/penguin.py
  • tests/unit/experience/test_rollouts.py
  • tests/unit/environments/test_penguin.py
  • examples/penguin/run_grpo_penguin.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/experience/rollouts.py
  • nemo_rl/environments/penguin.py
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

  • examples/penguin/run_penguin_single_node_sanity_tests.sh
🧠 Learnings (5)
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • nemo_rl/experience/rollouts.py
  • examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults

Applied to files:

  • nemo_rl/environments/penguin.py
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

  • examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • examples/penguin/run_penguin_single_node_sanity_tests.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • examples/penguin/run_penguin_single_node_sanity_tests.sh
🧬 Code graph analysis (5)
nemo_rl/experience/rollouts.py (2)
nemo_rl/utils/timer.py (5)
  • Timer (22-248)
  • start (79-83)
  • time (110-123)
  • stop (85-107)
  • get_timing_metrics (196-233)
nemo_rl/environments/penguin.py (1)
  • run_rollouts (106-138)
nemo_rl/environments/penguin.py (1)
nemo_rl/utils/timer.py (3)
  • Timer (22-248)
  • time (110-123)
  • get_timing_metrics (196-233)
examples/penguin/run_penguin_single_node_sanity_tests.sh (1)
tests/unit/environments/test_penguin.py (1)
  • penguin (80-130)
tests/unit/environments/test_penguin.py (1)
nemo_rl/environments/penguin.py (1)
  • run_rollouts (106-138)
examples/penguin/run_grpo_penguin.py (13)
nemo_rl/models/policy/interfaces.py (1)
  • ColocatablePolicyInterface (141-168)
nemo_rl/models/generation/interfaces.py (1)
  • GenerationInterface (215-254)
nemo_rl/utils/logger.py (2)
  • Logger (804-1039)
  • get_next_experiment_dir (1328-1362)
nemo_rl/algorithms/grpo.py (2)
  • _should_use_penguin (728-748)
  • refit_policy_generation (751-822)
nemo_rl/algorithms/utils.py (1)
  • get_tokenizer (184-315)
nemo_rl/data/datasets/processed_dataset.py (1)
  • AllTaskProcessedDataset (31-126)
nemo_rl/data/interfaces.py (1)
  • DatumSpec (32-40)
nemo_rl/distributed/ray_actor_environment_registry.py (1)
  • get_actor_python_env (49-64)
nemo_rl/distributed/virtual_cluster.py (1)
  • init_ray (85-171)
nemo_rl/environments/penguin.py (5)
  • Penguin (34-209)
  • PenguinConfig (27-30)
  • penguin_example_to_nemo_rl_datum_spec (235-248)
  • setup_penguin_config (217-226)
  • health_check (103-104)
nemo_rl/experience/rollouts.py (1)
  • run_async_penguin_rollout (958-1145)
nemo_rl/models/generation/__init__.py (1)
  • configure_generation_config (25-54)
nemo_rl/utils/config.py (1)
  • parse_hydra_overrides (146-166)
🪛 Ruff (0.14.3)
nemo_rl/experience/rollouts.py

1076-1076: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

examples/penguin/run_grpo_penguin.py

103-103: Do not assign a lambda expression, use a def

Rewrite passthrough_task_processor as a def

(E731)


103-103: Unused lambda argument: args

(ARG005)


103-103: Unused lambda argument: kwargs

(ARG005)


216-222: Avoid specifying long messages outside the exception class

(TRY003)


239-239: Unpacked variable cluster is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🪛 Shellcheck (0.11.0)
examples/penguin/run_penguin_single_node_sanity_tests.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (12)
tests/unit/experience/test_rollouts.py (1)

785-792: LGTM! Timing metrics properly integrated.

The new timing metric keys align with the Timer instrumentation added to the rollout flow. The test correctly normalizes these values to None during comparison, avoiding flakiness from non-deterministic timing.

tests/unit/environments/test_penguin.py (2)

145-166: LGTM! Test properly updated for new signature.

The test correctly passes the penguin_tokenizer fixture to run_rollouts and handles the updated return type (tuple of results and timing metrics). The empty timer_prefix is appropriate for test context.


169-191: LGTM! Proper normalization for reproducible tests.

The token_ids conversion and dummy value substitution for prompt_str/generation_str fields ensure deterministic test comparisons while accommodating the new decoded string fields added to penguin results.

nemo_rl/experience/rollouts.py (4)

51-51: LGTM! Timer instrumentation properly initialized.

The Timer import and initialization follow the correct pattern. The timer_prefix convention ("timing/rollout") provides clear namespacing for timing metrics.

Also applies to: 991-993


1007-1013: LGTM! Rollout timing properly captured.

The timer context around penguin_environment.run_rollouts correctly measures end-to-end rollout time, and the tuple unpacking matches the updated Penguin environment signature.


1016-1033: LGTM! Metric preparation properly timed.

The timer context around metric preparation provides visibility into preprocessing overhead. The metrics calculation correctly extracts token counts, rewards, and turns from Penguin results.


1108-1109: LGTM! Timer properly finalized and merged.

The timer is correctly stopped and timing metrics are merged into rollout_metrics using sum reduction, which is appropriate for cumulative timing measurements across the batch.

examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml (1)

1-267: LGTM! Comprehensive GRPO configuration for Penguin.

The configuration properly sets up GRPO training with the Qwen3-4B-Instruct model for Penguin integration. Key settings are correctly configured:

  • async_engine: true and expose_http_server: true (required for Penguin)
  • max_new_tokens set to full context length (per codebase pattern)
  • Tool parser configured for Qwen 3 4B Instruct
  • Data paths point to Penguin workspace

The configuration aligns with the new Penguin integration in the codebase.

Based on learnings

nemo_rl/environments/penguin.py (4)

19-19: LGTM! Required imports for new functionality.

The PreTrainedTokenizerBase import enables tokenizer-aware postprocessing, and the Timer import supports the new timing instrumentation in rollout execution.

Also applies to: 24-24


51-68: LGTM! More defensive config handling with helpful diagnostics.

The use of get() with a fallback and setdefault() for aiohttp limits makes the initialization more robust. The diagnostic print provides visibility into connection limit settings, which is helpful for debugging concurrency issues.


106-138: LGTM! Comprehensive timing instrumentation added.

The updated signature properly accepts a tokenizer and timer_prefix, enabling downstream string decoding and timing metrics. The timing instrumentation provides excellent observability:

  • Per-task await and postprocessing times
  • Total time tracking
  • Postprocess percentage calculation

The tuple return (nemo_rl_results, timing_metrics) is properly handled by callers.


140-198: LGTM! Tokenizer integration with memory optimization.

The postprocessing now accepts a tokenizer to decode token IDs into human-readable strings. The approach of:

  1. Converting token IDs to torch tensors for NeMo RL compatibility
  2. Decoding to strings for logging
  3. Popping large tensor fields from the result

...provides a good balance between functionality and memory efficiency. The prompt_str and generation_str fields improve debugging without bloating logs with full token ID tensors.

Signed-off-by: Brian Yu <bxyu@nvidia.com>
parthchadha
parthchadha previously approved these changes Nov 6, 2025
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@parthchadha parthchadha added the CI:L0 Run doctests and unit tests label Nov 7, 2025
@terrykong terrykong enabled auto-merge (squash) November 7, 2025 07:30
@terrykong terrykong merged commit 40de222 into main Nov 7, 2025
40 of 41 checks passed
@terrykong terrykong deleted the bxyu/run-penguin branch November 7, 2025 08:47
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Feb 20, 2026
4 tasks
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants