Skip to content

feat: Support for nano-v2#1514

Merged
terrykong merged 8 commits intomainfrom
yifu/nano-v2-main
Nov 18, 2025
Merged

feat: Support for nano-v2#1514
terrykong merged 8 commits intomainfrom
yifu/nano-v2-main

Conversation

@yfw
Copy link
Contributor

@yfw yfw commented Nov 13, 2025

What does this PR do ?

Adds support for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and nvidia/NVIDIA-Nemotron-Nano-12B-v2

Issues

List issues that this PR closes (syntax):
closes #1500
closes #1503

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

Sample runs:

Screenshot 2025-11-13 at 11 01 00 AM

Summary by CodeRabbit

  • New Features

    • Added YAML configurations for GRPO experiments with Nano-v2-12B models supporting both single-node (1N8G-Megatron) and multi-node (2N8G-FSDP2TP1) distributed training setups.
  • Bug Fixes

    • Enhanced robustness with safety checks for MLP layer access and improved parameter handling for packed sequences.
  • Tests

    • Added comprehensive test scripts for new GRPO configurations and expanded nightly test coverage.

yfw added 2 commits November 12, 2025 18:34
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: bb99fd4 (PR #1514 from yifu/nano-v2-main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 92c40c1 (PR #1514 from yifu/nano-v2-main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@yfw yfw changed the title Support for nano-v2 feat: Support for nano-v2 Nov 13, 2025
yfw added 2 commits November 13, 2025 09:58
Prevents incorrect dp size in parallel_state during initial import.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 2cd85c5 (PR #1514 from yifu/nano-v2-main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Nov 13, 2025
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 847d8cc (PR #1514 from yifu/nano-v2-main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@yfw yfw marked this pull request as ready for review November 13, 2025 18:35
@yfw yfw requested review from a team as code owners November 13, 2025 18:35
@yfw yfw requested review from terrykong and yaoyu-33 November 13, 2025 18:36
@yfw
Copy link
Contributor Author

yfw commented Nov 13, 2025

Megatron-Bridge change is a bump to this commit which has some required fixes for nano-v2: NVIDIA-NeMo/Megatron-Bridge@8aa287d

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

Updates Megatron-Bridge submodule reference, adds two new GRPO experiment configurations for Nano-v2 12B model, implements conditional packed_seq_params handling in Megatron model integration, adds context_parallel_size tracking for model export, includes MoE router safety checks, and introduces corresponding test scripts and nightly test entries.

Changes

Cohort / File(s) Summary
Submodule Reference
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
Updated submodule commit from f003cd8 to 8aa287d
GRPO Configuration Files
examples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml, examples/configs/recipes/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.yaml
Added two new YAML configuration files for Nano-v2 12B GRPO experiments with distinct parallelism strategies (single-node Megatron and two-node FSDP2TP1)
Megatron Integration Code
nemo_rl/models/megatron/common.py, nemo_rl/models/megatron/community_import.py
Modified packed_seq_params handling to conditionally pass to model; added context_parallel_size tracking during import/export flow
Policy Worker Updates
nemo_rl/models/policy/megatron_policy_worker.py
Added safety checks for MoE router access; updated forward path to conditionally pass packed_seq_params via kwargs
Test Scripts
tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh, tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh
Added two new bash test scripts for GRPO experiment validation with metrics checks (loss, token error, reward, timing)
Nightly Test Registry
tests/test_suites/nightly.txt
Added entries for two new GRPO test suite scripts

Sequence Diagram

sequenceDiagram
    participant Caller
    participant common.py
    participant Model
    participant policy_worker.py

    rect rgb(200, 220, 255)
    Note over Caller,Model: Previous Flow (packed_seq_params always passed)
    Caller->>common.py: pack_sequences(..., packed_seq_params)
    common.py->>Model: model(..., packed_seq_params=packed_seq_params)
    Note over Model: Always receives packed_seq_params<br/>(even if None)
    end

    rect rgb(220, 255, 220)
    Note over Caller,Model: New Flow (packed_seq_params conditionally passed)
    Caller->>common.py: pack_sequences(..., packed_seq_params)
    alt packed_seq_params is not None
        common.py->>common.py: additional_kwargs = {packed_seq_params}
    else packed_seq_params is None
        common.py->>common.py: additional_kwargs = {}
    end
    common.py->>Model: model(..., **additional_kwargs)
    Note over Model: Receives packed_seq_params only if provided
    end

    rect rgb(255, 240, 220)
    Note over policy_worker.py: MoE Router Safety Check
    policy_worker.py->>policy_worker.py: Check layer.mlp.router exists
    alt Router exists
        policy_worker.py->>policy_worker.py: Freeze router
    else Router missing
        policy_worker.py->>policy_worker.py: Skip safely
    end
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Conditional packed_seq_params handling follows a consistent, straightforward pattern applied in two files (common.py and policy_worker.py)
  • Context_parallel_size tracking involves isolated, self-contained additions to community_import.py
  • MoE router safety checks use standard defensive programming (attribute existence checks)
  • Configuration files and test scripts follow established patterns with no complex logic
  • Changes are well-scoped with minimal cross-file dependencies

Areas for attention:

  • Verify the conditional kwargs pattern correctly preserves existing behavior when packed_seq_params is None
  • Confirm context_parallel_size is properly initialized to 0 in the export path for mamba mixer compatibility
  • Validate MoE router attribute checks cover all required cases and don't mask legitimate errors

Possibly related PRs

Suggested labels

mcore

Suggested reviewers

  • yaoyu-33
  • terrykong
  • zpqiu

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major feature (new Nemotron Nano v2 model support) with code changes but PR description shows all pre-checks unchecked and contains no test results, performance data, or testing evidence. Execute new test scripts, document results in PR description, verify no regressions against baseline, and mark pre-checks complete before merging.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: adding support for nano-v2 models, which is reflected throughout the changeset including new configs, test scripts, and code adjustments.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yifu/nano-v2-main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/models/policy/megatron_policy_worker.py (1)

1557-1563: Inconsistent packed_seq_params forwarding pattern.

In get_topk_logits, packed_seq_params is passed directly to the model (line 1561), which means it will always be passed even when None. This is inconsistent with the conditional forwarding pattern introduced in get_logprobs (lines 1274-1283) and forward_step_arbitrary_loss in common.py (lines 351-360).

Consider applying the same pattern here for consistency:

+            additional_kwargs = {}
+            if packed_seq_params is not None:
+                additional_kwargs["packed_seq_params"] = packed_seq_params
+
             output_tensor = model(
                 input_ids=input_ids_cp_sharded,
                 position_ids=position_ids,
                 attention_mask=attention_mask,
-                packed_seq_params=packed_seq_params,
                 **multimodal_data,
+                **additional_kwargs,
             )
🧹 Nitpick comments (1)
nemo_rl/models/policy/megatron_policy_worker.py (1)

272-273: Consider more defensive attribute checking.

While the current check prevents AttributeError when mlp or router don't exist, it could still fail if layer.mlp exists but is None. Consider using the more defensive pattern shown later in this file (lines 2374-2375):

-                if hasattr(layer, "mlp") and hasattr(layer.mlp, "router"):
-                    layer.mlp.router.weight.requires_grad = False
+                mlp = getattr(layer, "mlp", None)
+                if mlp is not None and hasattr(mlp, "router"):
+                    mlp.router.weight.requires_grad = False
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 779f775 and 847d8cc.

📒 Files selected for processing (9)
  • 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1 hunks)
  • examples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.yaml (1 hunks)
  • nemo_rl/models/megatron/common.py (1 hunks)
  • nemo_rl/models/megatron/community_import.py (4 hunks)
  • nemo_rl/models/policy/megatron_policy_worker.py (2 hunks)
  • tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh (1 hunks)
  • tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh (1 hunks)
  • tests/test_suites/nightly.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-09-24T18:36:06.287Z
Learnt from: terrykong
Repo: NVIDIA-NeMo/RL PR: 1024
File: examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.yaml:1-1
Timestamp: 2025-09-24T18:36:06.287Z
Learning: In the NVIDIA NeMo RL repository, when working with Hydra config defaults, the scalar string format (defaults: ../../dpo.yaml) is acceptable and preferred over the list format, even though Hydra typically expects defaults to be a list.

Applied to files:

  • examples/configs/recipes/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.yaml
  • examples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh
  • tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh
  • tests/test_suites/nightly.txt
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh
  • tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • nemo_rl/models/megatron/community_import.py
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • tests/test_suites/nightly.txt
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (10)
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1)

1-1: Ignore this review comment—no submodule change is present in this PR.

Verification reveals that the .gitmodules file is unmodified and no submodule updates are staged. The commit hashes referenced in the review (both old and new) do not exist in the Megatron-Bridge repository. The review was based on incorrect assumptions about a submodule change that does not actually occur in this pull request.

Likely an incorrect or invalid review comment.

nemo_rl/models/megatron/common.py (1)

351-360: LGTM! Clean conditional parameter forwarding.

The pattern of conditionally adding packed_seq_params to additional_kwargs only when it's not None is a good practice. This prevents passing unnecessary None values to the model and provides a clean extension point for future optional parameters.

nemo_rl/models/policy/megatron_policy_worker.py (1)

1274-1283: LGTM! Consistent with common.py changes.

The conditional forwarding of packed_seq_params via additional_kwargs aligns well with the pattern introduced in nemo_rl/models/megatron/common.py (lines 351-360).

nemo_rl/models/megatron/community_import.py (2)

45-45: LGTM! Consistent context_parallel_size tracking.

The tracking and restoration of context_parallel_size follows the same pattern as other parallelism settings (tensor, pipeline, expert) already present in the code. This ensures runtime parallelism settings don't persist in saved checkpoints.

Also applies to: 63-63, 87-87


128-131: Manual seed initialization for mamba mixer.

The addition of model_parallel_cuda_manual_seed(0) in the CPU-distributed export context is noted in the summary as required for mamba mixer. This appears to be a targeted fix for a specific model architecture requirement.

tests/test_suites/nightly.txt (1)

51-53: LGTM! Test coverage for nano-v2 models.

The addition of nightly test entries for the new nano-v2 12B configurations follows the existing structure and provides appropriate test coverage for both Megatron and FSDP2/TP1 variants.

examples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml (1)

1-34: LGTM! Well-configured nano-v2 Megatron setup.

The configuration appropriately sets up the NVIDIA Nemotron Nano 12B v2 model with:

  • Megatron backend with TP=8 for 1-node 8-GPU deployment
  • Explicit feature toggles (bias_activation_fusion disabled, sequence_packing disabled)
  • Consistent 512-token limits across generation and data settings
  • Proper logging and checkpointing configuration
examples/configs/recipes/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.yaml (1)

1-44: LGTM! Memory-optimized FSDP2 configuration.

The 2-node FSDP2/TP1 configuration includes appropriate memory optimizations:

  • CPU offload and activation checkpointing enabled
  • Dynamic batching for efficient resource utilization
  • Multi-stage scheduler with 13-step linear warmup

The configuration complements the Megatron variant and provides an alternative deployment strategy for the nano-v2 12B model.

tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh (1)

1-41: LGTM! Test script follows project conventions.

The test script properly:

  • Configures a 30-step GRPO experiment with comprehensive logging (WandB, TensorBoard)
  • Converts TensorBoard logs to JSON for automated validation
  • Applies appropriate metrics thresholds for train loss, token error, reward, and timing

The script follows established patterns in the repository, including variable naming and argument passing conventions.

Based on learnings

tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh (1)

1-41: LGTM! 2-node test variant properly configured.

The 2-node FSDP2/TP1 test script mirrors the 1-node Megatron variant with appropriate adjustments:

  • NUM_NODES set to 2 for multi-node testing
  • Stricter timing threshold (60s vs 80s), reflecting expected performance improvement with additional nodes
  • Same validation metrics for consistency

Based on learnings

Add packed_seq_params change to get_topk_logits too

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 608cf6d (PR #1514 from yifu/nano-v2-main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 13, 2025
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment. otherwise lgtm. do you mind including the convergence wandb/plots

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 8656e81 (PR #1514 from yifu/nano-v2-main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@yfw
Copy link
Contributor Author

yfw commented Nov 14, 2025

small comment. otherwise lgtm. do you mind including the convergence wandb/plots

Added plots of the nightly tests in the top PR comment

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 14, 2025
@terrykong terrykong enabled auto-merge (squash) November 14, 2025 21:02
@terrykong terrykong merged commit c32778d into main Nov 18, 2025
41 of 43 checks passed
@terrykong terrykong deleted the yifu/nano-v2-main branch November 18, 2025 05:57
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@coderabbitai coderabbitai bot mentioned this pull request Jan 13, 2026
4 tasks
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Limited Support for Nemotron Models (e.g., Nano 12B) in NeMo RL Training Pipeline Support nano v2 on main branch

2 participants