cp: fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0#1454
cp: fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0#1454
fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0#1454Conversation
Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Anna Shors <ashors@nvidia.com>
📝 WalkthroughWalkthroughThis PR removes the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unit/models/policy/test_megatron_worker.py (1)
2435-2446: Consider addingstrict=Truetozip()for robustness.While the assertion at lines 2431-2433 already ensures matching lengths, adding
strict=Trueto thezip()call would provide an additional safeguard and address the static analysis warning.Apply this diff:
- for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm)): + for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm, strict=True)):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
examples/configs/distillation_math.yaml(0 hunks)examples/configs/distillation_math_megatron.yaml(0 hunks)examples/configs/dpo.yaml(0 hunks)examples/configs/grpo_math_1B.yaml(0 hunks)examples/configs/grpo_math_1B_megatron.yaml(0 hunks)examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml(1 hunks)examples/configs/rm.yaml(0 hunks)examples/configs/sft.yaml(0 hunks)examples/configs/sft_openmathinstruct2_megatron.yaml(0 hunks)examples/configs/vlm_grpo_3B.yaml(0 hunks)examples/configs/vlm_grpo_3B_megatron.yaml(0 hunks)nemo_rl/models/policy/__init__.py(1 hunks)nemo_rl/models/policy/megatron_policy_worker.py(3 hunks)tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh(1 hunks)tests/test_suites/nightly.txt(1 hunks)tests/unit/models/generation/test_vllm_generation.py(0 hunks)tests/unit/models/policy/test_megatron_worker.py(2 hunks)tools/refit_verifier.py(0 hunks)
💤 Files with no reviewable changes (12)
- examples/configs/distillation_math.yaml
- examples/configs/vlm_grpo_3B.yaml
- examples/configs/sft_openmathinstruct2_megatron.yaml
- examples/configs/vlm_grpo_3B_megatron.yaml
- examples/configs/rm.yaml
- tools/refit_verifier.py
- examples/configs/grpo_math_1B_megatron.yaml
- examples/configs/grpo_math_1B.yaml
- examples/configs/dpo.yaml
- tests/unit/models/generation/test_vllm_generation.py
- examples/configs/distillation_math_megatron.yaml
- examples/configs/sft.yaml
🧰 Additional context used
📓 Path-based instructions (10)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/models/policy/__init__.pynemo_rl/models/policy/megatron_policy_worker.pytests/unit/models/policy/test_megatron_worker.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/models/policy/__init__.pynemo_rl/models/policy/megatron_policy_worker.py
tests/test_suites/nightly.txt
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Append the new driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
Files:
tests/test_suites/nightly.txt
tests/test_suites/**
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt
Files:
tests/test_suites/nightly.txttests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name
Files:
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
examples/configs/recipes/llm/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
LLM recipe YAML filenames must follow: --ng-[-modifiers][-long][.vN].yaml
Files:
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
examples/configs/recipes/**/*.{yaml,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script
Files:
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
examples/configs/recipes/**
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Place recipe YAMLs under examples/configs/recipes//
Files:
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Useuv runto execute Python scripts in shell/driver scripts instead of activating virtualenvs and callingpythondirectly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts
Files:
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
tests/test_suites/llm/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension
Files:
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
🧠 Learnings (13)
📚 Learning: 2025-09-19T02:44:38.451Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:73-84
Timestamp: 2025-09-19T02:44:38.451Z
Learning: The scheduler configuration format with a separate "milestones: [20]" entry (not wrapped under name/kwargs) is a valid and established pattern used across GRPO, DPO, and distillation configs in the NeMo RL codebase. This format specifies transition points between different schedulers (e.g., LinearLR for warmup steps, then ConstantLR).
Applied to files:
nemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Applied to files:
nemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/nightly.txt : Append the new driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
Applied to files:
tests/test_suites/nightly.txt
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/** : Place driver shell scripts and common.env under tests/test_suites/<domain>/ and list nightly tests in tests/test_suites/nightly.txt
Applied to files:
tests/test_suites/nightly.txt
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/llm/*.sh : LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension
Applied to files:
tests/test_suites/nightly.txttests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.{yaml,sh} : Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script
Applied to files:
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name
Applied to files:
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : LLM recipe YAML filenames must follow: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml
Applied to files:
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/**/*.{sh} : For new model support, add a matching driver shell script under tests/test_suites/<domain>/ that sources common.env and invokes 'uv run ... --config <yaml>'
Applied to files:
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-17T01:52:21.399Z
Learnt from: ffrujeri
Repo: NVIDIA-NeMo/RL PR: 1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.
Applied to files:
nemo_rl/models/policy/megatron_policy_worker.py
🧬 Code graph analysis (1)
tests/unit/models/policy/test_megatron_worker.py (7)
tests/unit/conftest.py (1)
tiny_llama_model_path(456-480)nemo_rl/distributed/batched_data_dict.py (2)
to(825-832)BatchedDataDict(75-860)nemo_rl/distributed/virtual_cluster.py (2)
RayVirtualCluster(186-505)shutdown(477-496)nemo_rl/models/policy/lm_policy.py (3)
prepare_for_training(652-655)train(465-558)shutdown(753-760)nemo_rl/algorithms/loss_functions.py (1)
NLLLoss(427-503)nemo_rl/models/policy/megatron_policy_worker.py (3)
prepare_for_training(2009-2032)train(905-1155)shutdown(2235-2240)nemo_rl/distributed/worker_groups.py (2)
run_all_workers_single_data(728-772)shutdown(930-1004)
🪛 Ruff (0.14.3)
tests/unit/models/policy/test_megatron_worker.py
2435-2435: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
[warning] 9-9: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 12-12: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 13-13: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 19-19: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 32-32: Double quote array expansions to avoid re-splitting elements.
(SC2068)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (17)
nemo_rl/models/policy/__init__.py (2)
84-84: LGTM: Type widening allows explicit None values.Widening
lr_decay_itersto allowNoneis appropriate and aligns with actual usage patterns throughout the codebase (e.g., test configs and the new SFT recipe at line 30 ofsft-qwen2.5-math7b-2n8g-megatron.yaml).
89-95: LGTM: Removal ofaverage_in_collectiveprevents misconfiguration.Removing
average_in_collectivefrom theMegatronDDPConfigschema is appropriate since this field is now set programmatically inmegatron_policy_worker.py(lines 713-715) based on thecalculate_per_token_losssetting. This prevents user misconfiguration and aligns with the PR's objective to match dtensor backend behavior.tests/test_suites/nightly.txt (1)
69-70: LGTM: New nightly test for TP/DP validation.The addition of the SFT Qwen2.5-Math-7B Megatron test with the "validate TP/DP" comment is appropriate and aligns with the PR's focus on validating gradient norm consistency across tensor and data parallelism configurations.
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml (3)
1-20: LGTM: Megatron configuration with MoE stabilization.The configuration appropriately sets up Megatron with TP=4, CP=2 for a 2-node 8-GPU cluster. The MoE stabilization settings (frozen router, fp64 precision, zero bias update rate) align with the approach documented in
megatron_policy_worker.py(lines 571-583).
21-32: LGTM: Optimizer and scheduler configuration.The optimizer and scheduler settings are consistent: using a constant learning rate (lr == min_lr) with
lr_decay_iters: nullis appropriate. The use ofnullvalue is now valid after the schema change innemo_rl/models/policy/__init__.pyline 84.
33-53: LGTM: Data and logging configuration.The data configuration (openmathinstruct2 dataset with math prompts) and logging setup (wandb, tensorboard, mlflow) are appropriate for an SFT training recipe.
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh (3)
1-14: LGTM: Standard test script header with configuration.The script header follows the established pattern for test scripts in this repository. The variables flagged as unused by shellcheck (
NUM_NODES,NUM_RUNS,NUM_MINUTES) are consumed by external launch tooling per established project conventions.
16-33: LGTM: Main execution block follows repository conventions.The script execution block follows the established pattern. The unquoted
$@on line 32 is consistent with repository conventions per established project patterns.
35-43: LGTM: Metrics validation with appropriate thresholds.The metrics validation section follows the standard pattern. The loss thresholds (train/loss < 0.301, val_loss < 0.304) are reasonable for an 80-step SFT training run.
nemo_rl/models/policy/megatron_policy_worker.py (4)
672-679: LGTM: Essential settings for correct gradient computation.The comment clearly explains why
calculate_per_token_loss=Trueandperform_initialization=Trueare necessary:
- No gradient scaling in mcore requires nemo-rl to handle it
- Ensures correct tensor parallel attributes are set on TP-sharded parameters
This aligns with the broader PR changes to match dtensor backend behavior.
680-686: LGTM: Defensive assertion against known Megatron-LM bug.The assertion prevents MoE auxiliary loss usage, which is affected by a known Megatron-LM bug. The GitHub issue link provides good context for this limitation.
713-715: LGTM: Required setting for calculate_per_token_loss.Setting
average_in_collective=Falseis necessary whencalculate_per_token_loss=Trueto avoid an assertion error in mcore. This programmatic control (vs. user configuration) prevents misconfiguration, which is whyaverage_in_collectivewas removed from the config schema.
2256-2298: LGTM: Useful diagnostic method for TP attribute validation.The new
check_tensor_parallel_attributesmethod provides a clean way to inspect and validate tensor parallel attributes on model parameters. This is used effectively in the new testtest_megatron_gradient_norm_consistency_across_parallelismto verify correct TP sharding.tests/unit/models/policy/test_megatron_worker.py (4)
19-21: LGTM: Import additions support new test.The
numpyandrayimports are appropriately added to support the new gradient norm consistency test (numpy for loss comparisons at line 2449, ray forray.get()calls at line 2377).
149-154: LGTM: Test config updated to match schema change.Removing
average_in_collectivefrom the test config is consistent with its removal from theMegatronDDPConfigschema. The field is now set programmatically based oncalculate_per_token_loss.
2270-2402: LGTM: Comprehensive test for gradient norm consistency.The test thoroughly validates gradient norm consistency across different parallelism configurations (DP1TP1, DP2, TP2) using:
- Reproducible data with fixed seed
- Training with NLLLoss
- Validation of gradient norms and losses
- Verification of tensor parallel attributes when TP > 1
The test properly cleans up resources after each configuration.
2404-2463: LGTM: Thorough cross-configuration comparison.The comparison logic properly validates gradient norm and loss consistency across configurations with appropriate tolerances (1% relative or 1e-6 absolute). The assertions provide clear error messages with detailed difference metrics.
fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0
Signed-off-by: Terry Kong <terryk@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Configuration Changes
average_in_collectiveflag from distributed data parallel configurations across multiple example configs.