Skip to content

cp: fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0#1454

Merged
terrykong merged 3 commits intor0.4.0from
ashors/cherry-pick-1389
Nov 4, 2025
Merged

cp: fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0#1454
terrykong merged 3 commits intor0.4.0from
ashors/cherry-pick-1389

Conversation

@ashors1
Copy link
Contributor

@ashors1 ashors1 commented Oct 31, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added SFT training recipe configuration for Qwen 2.5 Math 7B model with Megatron distributed setup (2 nodes, 8 GPUs).
    • Added new test suite for SFT training validation.
    • Added tensor parallelism attribute verification utility.
  • Configuration Changes

    • Removed average_in_collective flag from distributed data parallel configurations across multiple example configs.
    • Updated gradient handling and model initialization settings.

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
@ashors1 ashors1 requested review from a team as code owners October 31, 2025 22:30
@ashors1 ashors1 requested a review from terrykong October 31, 2025 22:30
@ashors1 ashors1 changed the title cherry-pick: fix: Fixes to make Megatron backend match dtensor fix: cherry-pick: Fixes to make Megatron backend match dtensor Nov 3, 2025
@chtruong814 chtruong814 added the CI:L1 Run doctests, unit tests, and functional tests label Nov 3, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

📝 Walkthrough

Walkthrough

This PR removes the average_in_collective configuration option from distributed data parallel settings across multiple YAML configs and the MegatronDDPConfig schema. It adjusts MegatronPolicyWorker initialization to set average_in_collective=False when calculate_per_token_loss=True, adds a tensor parallel attributes checker method, and introduces new test cases validating gradient norm consistency across parallelism configurations.

Changes

Cohort / File(s) Summary
Config files removing average_in_collective
examples/configs/distillation_math.yaml, distillation_math_megatron.yaml, dpo.yaml, grpo_math_1B.yaml, grpo_math_1B_megatron.yaml, rm.yaml, sft.yaml, sft_openmathinstruct2_megatron.yaml, vlm_grpo_3B.yaml, vlm_grpo_3B_megatron.yaml
Removed average_in_collective: true setting from distributed_data_parallel_config or policy.megatron_cfg.distributed_data_parallel_config across 10 configuration files
New SFT example configuration
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
Added new end-to-end SFT training recipe with Megatron tensor/model parallelism, sequence packing, distributed logging (W&B, TensorBoard, MLflow), and specific optimizer/scheduler settings for 7B-class model on 2 nodes with 8 GPUs per node
Policy worker modifications
nemo_rl/models/policy/megatron_policy_worker.py
Modified __init__ to enable calculate_per_token_loss and perform_initialization, add MoE load-balancing auxiliary loss assertion, and explicitly set average_in_collective=False when applicable; added check_tensor_parallel_attributes() helper method to collect tensor-parallel parameter metadata and distribution statistics
Schema changes
nemo_rl/models/policy/__init__.py
Removed average_in_collective field from MegatronDDPConfig; widened MegatronSchedulerConfig.lr_decay_iters to accept int | None
Unit test removals and additions
tests/unit/models/policy/test_megatron_worker.py
Removed average_in_collective from create_megatron_test_config; added imports (numpy, ray); added test_megatron_gradient_norm_consistency_across_parallelism() validating gradient norms and losses across DP1TP1, DP2, TP2 configurations with cross-configuration comparisons
Generation test updates
tests/unit/models/generation/test_vllm_generation.py
Removed average_in_collective flag from get_basic_megatron_test_config
Test suite and harness additions
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh, tests/test_suites/nightly.txt
Added new SFT integration test script with loss validation metrics (train/loss < 0.301, validation/val_loss < 0.304 at step 80) and registered test in nightly suite
Tool configuration updates
tools/refit_verifier.py
Removed average_in_collective flag from Megatron distributed data parallel configuration

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • MegatronPolicyWorker.init logic: Verify the conditional setting of average_in_collective=False and its interaction with calculate_per_token_loss, including the MoE auxiliary loss assertion
  • New tensor parallel attributes method: Ensure parameter traversal and metadata collection are comprehensive and correctly expose tensor-parallel distribution
  • Gradient norm consistency test: Validate test setup across three parallelism configurations (DP1TP1, DP2, TP2), cross-configuration tolerance thresholds, and parameter sharding verification
  • Config consistency: Confirm all YAML removals are uniform and don't break dependent configs or introduce inconsistent state

Possibly related PRs

Suggested labels

cherry-pick, Run CICD, r0.4.0, CI:L1

Suggested reviewers

  • terrykong
  • zpqiu

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning Pull request contains major breaking API changes and behavioral modifications affecting numerical computation, but test results are not documented in PR description. Document test results, validation metrics, and numerical regression checks in PR description; link related issues and explain breaking changes rationale.
Title check ⚠️ Warning The title is a cherry-pick operation reference and does not clearly describe the main changes made in this PR (removal of average_in_collective flag and related fixes). Revise the title to clearly describe the primary changes, such as 'Remove average_in_collective flag from Megatron configurations' or 'Fix Megatron backend to match dtensor behavior'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ashors/cherry-pick-1389

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unit/models/policy/test_megatron_worker.py (1)

2435-2446: Consider adding strict=True to zip() for robustness.

While the assertion at lines 2431-2433 already ensures matching lengths, adding strict=True to the zip() call would provide an additional safeguard and address the static analysis warning.

Apply this diff:

-            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm)):
+            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm, strict=True)):
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 43f5f6a and 9748484.

📒 Files selected for processing (18)
  • examples/configs/distillation_math.yaml (0 hunks)
  • examples/configs/distillation_math_megatron.yaml (0 hunks)
  • examples/configs/dpo.yaml (0 hunks)
  • examples/configs/grpo_math_1B.yaml (0 hunks)
  • examples/configs/grpo_math_1B_megatron.yaml (0 hunks)
  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml (1 hunks)
  • examples/configs/rm.yaml (0 hunks)
  • examples/configs/sft.yaml (0 hunks)
  • examples/configs/sft_openmathinstruct2_megatron.yaml (0 hunks)
  • examples/configs/vlm_grpo_3B.yaml (0 hunks)
  • examples/configs/vlm_grpo_3B_megatron.yaml (0 hunks)
  • nemo_rl/models/policy/__init__.py (1 hunks)
  • nemo_rl/models/policy/megatron_policy_worker.py (3 hunks)
  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh (1 hunks)
  • tests/test_suites/nightly.txt (1 hunks)
  • tests/unit/models/generation/test_vllm_generation.py (0 hunks)
  • tests/unit/models/policy/test_megatron_worker.py (2 hunks)
  • tools/refit_verifier.py (0 hunks)
💤 Files with no reviewable changes (12)
  • examples/configs/distillation_math.yaml
  • examples/configs/vlm_grpo_3B.yaml
  • examples/configs/sft_openmathinstruct2_megatron.yaml
  • examples/configs/vlm_grpo_3B_megatron.yaml
  • examples/configs/rm.yaml
  • tools/refit_verifier.py
  • examples/configs/grpo_math_1B_megatron.yaml
  • examples/configs/grpo_math_1B.yaml
  • examples/configs/dpo.yaml
  • tests/unit/models/generation/test_vllm_generation.py
  • examples/configs/distillation_math_megatron.yaml
  • examples/configs/sft.yaml
🧰 Additional context used
📓 Path-based instructions (10)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/policy/__init__.py
  • nemo_rl/models/policy/megatron_policy_worker.py
  • tests/unit/models/policy/test_megatron_worker.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/policy/__init__.py
  • nemo_rl/models/policy/megatron_policy_worker.py
tests/test_suites/nightly.txt

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Append the new driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Files:

  • tests/test_suites/nightly.txt
tests/test_suites/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt

Files:

  • tests/test_suites/nightly.txt
  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
examples/configs/recipes/llm/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM recipe YAML filenames must follow: --ng-[-modifiers][-long][.vN].yaml

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
examples/configs/recipes/**/*.{yaml,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
examples/configs/recipes/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place recipe YAMLs under examples/configs/recipes//

Files:

  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
tests/test_suites/llm/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Files:

  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
🧠 Learnings (13)
📚 Learning: 2025-09-19T02:44:38.451Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:73-84
Timestamp: 2025-09-19T02:44:38.451Z
Learning: The scheduler configuration format with a separate "milestones: [20]" entry (not wrapped under name/kwargs) is a valid and established pattern used across GRPO, DPO, and distillation configs in the NeMo RL codebase. This format specifies transition points between different schedulers (e.g., LinearLR for warmup steps, then ConstantLR).

Applied to files:

  • nemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults

Applied to files:

  • nemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/nightly.txt : Append the new driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/nightly.txt
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/** : Place driver shell scripts and common.env under tests/test_suites/<domain>/ and list nightly tests in tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/nightly.txt
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/llm/*.sh : LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Applied to files:

  • tests/test_suites/nightly.txt
  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.{yaml,sh} : Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Applied to files:

  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name

Applied to files:

  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : LLM recipe YAML filenames must follow: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml

Applied to files:

  • examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to tests/test_suites/**/*.{sh} : For new model support, add a matching driver shell script under tests/test_suites/<domain>/ that sources common.env and invokes 'uv run ... --config <yaml>'

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
📚 Learning: 2025-09-17T01:52:21.399Z
Learnt from: ffrujeri
Repo: NVIDIA-NeMo/RL PR: 1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.

Applied to files:

  • nemo_rl/models/policy/megatron_policy_worker.py
🧬 Code graph analysis (1)
tests/unit/models/policy/test_megatron_worker.py (7)
tests/unit/conftest.py (1)
  • tiny_llama_model_path (456-480)
nemo_rl/distributed/batched_data_dict.py (2)
  • to (825-832)
  • BatchedDataDict (75-860)
nemo_rl/distributed/virtual_cluster.py (2)
  • RayVirtualCluster (186-505)
  • shutdown (477-496)
nemo_rl/models/policy/lm_policy.py (3)
  • prepare_for_training (652-655)
  • train (465-558)
  • shutdown (753-760)
nemo_rl/algorithms/loss_functions.py (1)
  • NLLLoss (427-503)
nemo_rl/models/policy/megatron_policy_worker.py (3)
  • prepare_for_training (2009-2032)
  • train (905-1155)
  • shutdown (2235-2240)
nemo_rl/distributed/worker_groups.py (2)
  • run_all_workers_single_data (728-772)
  • shutdown (930-1004)
🪛 Ruff (0.14.3)
tests/unit/models/policy/test_megatron_worker.py

2435-2435: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh

[warning] 9-9: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 12-12: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 13-13: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 19-19: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 32-32: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (17)
nemo_rl/models/policy/__init__.py (2)

84-84: LGTM: Type widening allows explicit None values.

Widening lr_decay_iters to allow None is appropriate and aligns with actual usage patterns throughout the codebase (e.g., test configs and the new SFT recipe at line 30 of sft-qwen2.5-math7b-2n8g-megatron.yaml).


89-95: LGTM: Removal of average_in_collective prevents misconfiguration.

Removing average_in_collective from the MegatronDDPConfig schema is appropriate since this field is now set programmatically in megatron_policy_worker.py (lines 713-715) based on the calculate_per_token_loss setting. This prevents user misconfiguration and aligns with the PR's objective to match dtensor backend behavior.

tests/test_suites/nightly.txt (1)

69-70: LGTM: New nightly test for TP/DP validation.

The addition of the SFT Qwen2.5-Math-7B Megatron test with the "validate TP/DP" comment is appropriate and aligns with the PR's focus on validating gradient norm consistency across tensor and data parallelism configurations.

examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml (3)

1-20: LGTM: Megatron configuration with MoE stabilization.

The configuration appropriately sets up Megatron with TP=4, CP=2 for a 2-node 8-GPU cluster. The MoE stabilization settings (frozen router, fp64 precision, zero bias update rate) align with the approach documented in megatron_policy_worker.py (lines 571-583).


21-32: LGTM: Optimizer and scheduler configuration.

The optimizer and scheduler settings are consistent: using a constant learning rate (lr == min_lr) with lr_decay_iters: null is appropriate. The use of null value is now valid after the schema change in nemo_rl/models/policy/__init__.py line 84.


33-53: LGTM: Data and logging configuration.

The data configuration (openmathinstruct2 dataset with math prompts) and logging setup (wandb, tensorboard, mlflow) are appropriate for an SFT training recipe.

tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh (3)

1-14: LGTM: Standard test script header with configuration.

The script header follows the established pattern for test scripts in this repository. The variables flagged as unused by shellcheck (NUM_NODES, NUM_RUNS, NUM_MINUTES) are consumed by external launch tooling per established project conventions.


16-33: LGTM: Main execution block follows repository conventions.

The script execution block follows the established pattern. The unquoted $@ on line 32 is consistent with repository conventions per established project patterns.


35-43: LGTM: Metrics validation with appropriate thresholds.

The metrics validation section follows the standard pattern. The loss thresholds (train/loss < 0.301, val_loss < 0.304) are reasonable for an 80-step SFT training run.

nemo_rl/models/policy/megatron_policy_worker.py (4)

672-679: LGTM: Essential settings for correct gradient computation.

The comment clearly explains why calculate_per_token_loss=True and perform_initialization=True are necessary:

  • No gradient scaling in mcore requires nemo-rl to handle it
  • Ensures correct tensor parallel attributes are set on TP-sharded parameters

This aligns with the broader PR changes to match dtensor backend behavior.


680-686: LGTM: Defensive assertion against known Megatron-LM bug.

The assertion prevents MoE auxiliary loss usage, which is affected by a known Megatron-LM bug. The GitHub issue link provides good context for this limitation.


713-715: LGTM: Required setting for calculate_per_token_loss.

Setting average_in_collective=False is necessary when calculate_per_token_loss=True to avoid an assertion error in mcore. This programmatic control (vs. user configuration) prevents misconfiguration, which is why average_in_collective was removed from the config schema.


2256-2298: LGTM: Useful diagnostic method for TP attribute validation.

The new check_tensor_parallel_attributes method provides a clean way to inspect and validate tensor parallel attributes on model parameters. This is used effectively in the new test test_megatron_gradient_norm_consistency_across_parallelism to verify correct TP sharding.

tests/unit/models/policy/test_megatron_worker.py (4)

19-21: LGTM: Import additions support new test.

The numpy and ray imports are appropriately added to support the new gradient norm consistency test (numpy for loss comparisons at line 2449, ray for ray.get() calls at line 2377).


149-154: LGTM: Test config updated to match schema change.

Removing average_in_collective from the test config is consistent with its removal from the MegatronDDPConfig schema. The field is now set programmatically based on calculate_per_token_loss.


2270-2402: LGTM: Comprehensive test for gradient norm consistency.

The test thoroughly validates gradient norm consistency across different parallelism configurations (DP1TP1, DP2, TP2) using:

  • Reproducible data with fixed seed
  • Training with NLLLoss
  • Validation of gradient norms and losses
  • Verification of tensor parallel attributes when TP > 1

The test properly cleans up resources after each configuration.


2404-2463: LGTM: Thorough cross-configuration comparison.

The comparison logic properly validates gradient norm and loss consistency across configurations with appropriate tolerances (1% relative or 1e-6 absolute). The assertions provide clear error messages with detailed difference metrics.

@terrykong terrykong enabled auto-merge (squash) November 4, 2025 07:12
@terrykong terrykong changed the title fix: cherry-pick: Fixes to make Megatron backend match dtensor cp: fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0 Nov 4, 2025
terrykong
terrykong previously approved these changes Nov 4, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 4, 2025
@terrykong terrykong merged commit 31722fe into r0.4.0 Nov 4, 2025
40 of 41 checks passed
@terrykong terrykong deleted the ashors/cherry-pick-1389 branch November 4, 2025 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants