Conversation
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Prevents incorrect dp size in parallel_state during initial import. Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
|
Megatron-Bridge change is a bump to this commit which has some required fixes for nano-v2: NVIDIA-NeMo/Megatron-Bridge@8aa287d |
📝 WalkthroughWalkthroughUpdates Megatron-Bridge submodule reference, adds two new GRPO experiment configurations for Nano-v2 12B model, implements conditional packed_seq_params handling in Megatron model integration, adds context_parallel_size tracking for model export, includes MoE router safety checks, and introduces corresponding test scripts and nightly test entries. Changes
Sequence DiagramsequenceDiagram
participant Caller
participant common.py
participant Model
participant policy_worker.py
rect rgb(200, 220, 255)
Note over Caller,Model: Previous Flow (packed_seq_params always passed)
Caller->>common.py: pack_sequences(..., packed_seq_params)
common.py->>Model: model(..., packed_seq_params=packed_seq_params)
Note over Model: Always receives packed_seq_params<br/>(even if None)
end
rect rgb(220, 255, 220)
Note over Caller,Model: New Flow (packed_seq_params conditionally passed)
Caller->>common.py: pack_sequences(..., packed_seq_params)
alt packed_seq_params is not None
common.py->>common.py: additional_kwargs = {packed_seq_params}
else packed_seq_params is None
common.py->>common.py: additional_kwargs = {}
end
common.py->>Model: model(..., **additional_kwargs)
Note over Model: Receives packed_seq_params only if provided
end
rect rgb(255, 240, 220)
Note over policy_worker.py: MoE Router Safety Check
policy_worker.py->>policy_worker.py: Check layer.mlp.router exists
alt Router exists
policy_worker.py->>policy_worker.py: Freeze router
else Router missing
policy_worker.py->>policy_worker.py: Skip safely
end
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Areas for attention:
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/models/policy/megatron_policy_worker.py (1)
1557-1563: Inconsistent packed_seq_params forwarding pattern.In
get_topk_logits,packed_seq_paramsis passed directly to the model (line 1561), which means it will always be passed even whenNone. This is inconsistent with the conditional forwarding pattern introduced inget_logprobs(lines 1274-1283) andforward_step_arbitrary_lossin common.py (lines 351-360).Consider applying the same pattern here for consistency:
+ additional_kwargs = {} + if packed_seq_params is not None: + additional_kwargs["packed_seq_params"] = packed_seq_params + output_tensor = model( input_ids=input_ids_cp_sharded, position_ids=position_ids, attention_mask=attention_mask, - packed_seq_params=packed_seq_params, **multimodal_data, + **additional_kwargs, )
🧹 Nitpick comments (1)
nemo_rl/models/policy/megatron_policy_worker.py (1)
272-273: Consider more defensive attribute checking.While the current check prevents AttributeError when
mlporrouterdon't exist, it could still fail iflayer.mlpexists but isNone. Consider using the more defensive pattern shown later in this file (lines 2374-2375):- if hasattr(layer, "mlp") and hasattr(layer.mlp, "router"): - layer.mlp.router.weight.requires_grad = False + mlp = getattr(layer, "mlp", None) + if mlp is not None and hasattr(mlp, "router"): + mlp.router.weight.requires_grad = False
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge(1 hunks)examples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml(1 hunks)examples/configs/recipes/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.yaml(1 hunks)nemo_rl/models/megatron/common.py(1 hunks)nemo_rl/models/megatron/community_import.py(4 hunks)nemo_rl/models/policy/megatron_policy_worker.py(2 hunks)tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh(1 hunks)tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh(1 hunks)tests/test_suites/nightly.txt(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-09-24T18:36:06.287Z
Learnt from: terrykong
Repo: NVIDIA-NeMo/RL PR: 1024
File: examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.yaml:1-1
Timestamp: 2025-09-24T18:36:06.287Z
Learning: In the NVIDIA NeMo RL repository, when working with Hydra config defaults, the scalar string format (defaults: ../../dpo.yaml) is acceptable and preferred over the list format, even though Hydra typically expects defaults to be a list.
Applied to files:
examples/configs/recipes/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.yamlexamples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.shtests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.shtests/test_suites/nightly.txt
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.shtests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
Applied to files:
nemo_rl/models/megatron/community_import.py
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
tests/test_suites/nightly.txt
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (10)
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1)
1-1: Ignore this review comment—no submodule change is present in this PR.Verification reveals that the
.gitmodulesfile is unmodified and no submodule updates are staged. The commit hashes referenced in the review (both old and new) do not exist in the Megatron-Bridge repository. The review was based on incorrect assumptions about a submodule change that does not actually occur in this pull request.Likely an incorrect or invalid review comment.
nemo_rl/models/megatron/common.py (1)
351-360: LGTM! Clean conditional parameter forwarding.The pattern of conditionally adding
packed_seq_paramstoadditional_kwargsonly when it's not None is a good practice. This prevents passing unnecessary None values to the model and provides a clean extension point for future optional parameters.nemo_rl/models/policy/megatron_policy_worker.py (1)
1274-1283: LGTM! Consistent with common.py changes.The conditional forwarding of
packed_seq_paramsviaadditional_kwargsaligns well with the pattern introduced innemo_rl/models/megatron/common.py(lines 351-360).nemo_rl/models/megatron/community_import.py (2)
45-45: LGTM! Consistent context_parallel_size tracking.The tracking and restoration of
context_parallel_sizefollows the same pattern as other parallelism settings (tensor, pipeline, expert) already present in the code. This ensures runtime parallelism settings don't persist in saved checkpoints.Also applies to: 63-63, 87-87
128-131: Manual seed initialization for mamba mixer.The addition of
model_parallel_cuda_manual_seed(0)in the CPU-distributed export context is noted in the summary as required for mamba mixer. This appears to be a targeted fix for a specific model architecture requirement.tests/test_suites/nightly.txt (1)
51-53: LGTM! Test coverage for nano-v2 models.The addition of nightly test entries for the new nano-v2 12B configurations follows the existing structure and provides appropriate test coverage for both Megatron and FSDP2/TP1 variants.
examples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml (1)
1-34: LGTM! Well-configured nano-v2 Megatron setup.The configuration appropriately sets up the NVIDIA Nemotron Nano 12B v2 model with:
- Megatron backend with TP=8 for 1-node 8-GPU deployment
- Explicit feature toggles (bias_activation_fusion disabled, sequence_packing disabled)
- Consistent 512-token limits across generation and data settings
- Proper logging and checkpointing configuration
examples/configs/recipes/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.yaml (1)
1-44: LGTM! Memory-optimized FSDP2 configuration.The 2-node FSDP2/TP1 configuration includes appropriate memory optimizations:
- CPU offload and activation checkpointing enabled
- Dynamic batching for efficient resource utilization
- Multi-stage scheduler with 13-step linear warmup
The configuration complements the Megatron variant and provides an alternative deployment strategy for the nano-v2 12B model.
tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh (1)
1-41: LGTM! Test script follows project conventions.The test script properly:
- Configures a 30-step GRPO experiment with comprehensive logging (WandB, TensorBoard)
- Converts TensorBoard logs to JSON for automated validation
- Applies appropriate metrics thresholds for train loss, token error, reward, and timing
The script follows established patterns in the repository, including variable naming and argument passing conventions.
Based on learnings
tests/test_suites/llm/grpo-nano-v2-12b-2n8g-fsdp2tp1.sh (1)
1-41: LGTM! 2-node test variant properly configured.The 2-node FSDP2/TP1 test script mirrors the 1-node Megatron variant with appropriate adjustments:
- NUM_NODES set to 2 for multi-node testing
- Stricter timing threshold (60s vs 80s), reflecting expected performance improvement with additional nodes
- Same validation metrics for consistency
Based on learnings
Add packed_seq_params change to get_topk_logits too Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
terrykong
left a comment
There was a problem hiding this comment.
small comment. otherwise lgtm. do you mind including the convergence wandb/plots
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Added plots of the nightly tests in the top PR comment |
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
Adds support for
nvidia/NVIDIA-Nemotron-Nano-9B-v2andnvidia/NVIDIA-Nemotron-Nano-12B-v2Issues
List issues that this PR closes (syntax):
closes #1500
closes #1503
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Sample runs:
Summary by CodeRabbit
New Features
Bug Fixes
Tests