cp: `fix: Fixes to make Megatron backend match dtensor (1389)` into `r0.4.0` by ashors1 · Pull Request #1454 · NVIDIA-NeMo/RL

ashors1 · 2025-10-31T22:30:35Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added SFT training recipe configuration for Qwen 2.5 Math 7B model with Megatron distributed setup (2 nodes, 8 GPUs).
- Added new test suite for SFT training validation.
- Added tensor parallelism attribute verification utility.
Configuration Changes
- Removed average_in_collective flag from distributed data parallel configurations across multiple example configs.
- Updated gradient handling and model initialization settings.

Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Anna Shors <ashors@nvidia.com>

…-pick-1389

coderabbitai · 2025-11-03T23:29:32Z

📝 Walkthrough

Walkthrough

This PR removes the average_in_collective configuration option from distributed data parallel settings across multiple YAML configs and the MegatronDDPConfig schema. It adjusts MegatronPolicyWorker initialization to set average_in_collective=False when calculate_per_token_loss=True, adds a tensor parallel attributes checker method, and introduces new test cases validating gradient norm consistency across parallelism configurations.

Changes

Cohort / File(s)	Summary
Config files removing average_in_collective `examples/configs/distillation_math.yaml`, `distillation_math_megatron.yaml`, `dpo.yaml`, `grpo_math_1B.yaml`, `grpo_math_1B_megatron.yaml`, `rm.yaml`, `sft.yaml`, `sft_openmathinstruct2_megatron.yaml`, `vlm_grpo_3B.yaml`, `vlm_grpo_3B_megatron.yaml`	Removed `average_in_collective: true` setting from `distributed_data_parallel_config` or `policy.megatron_cfg.distributed_data_parallel_config` across 10 configuration files
New SFT example configuration `examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml`	Added new end-to-end SFT training recipe with Megatron tensor/model parallelism, sequence packing, distributed logging (W&B, TensorBoard, MLflow), and specific optimizer/scheduler settings for 7B-class model on 2 nodes with 8 GPUs per node
Policy worker modifications `nemo_rl/models/policy/megatron_policy_worker.py`	Modified `__init__` to enable `calculate_per_token_loss` and `perform_initialization`, add MoE load-balancing auxiliary loss assertion, and explicitly set `average_in_collective=False` when applicable; added `check_tensor_parallel_attributes()` helper method to collect tensor-parallel parameter metadata and distribution statistics
Schema changes `nemo_rl/models/policy/__init__.py`	Removed `average_in_collective` field from MegatronDDPConfig; widened `MegatronSchedulerConfig.lr_decay_iters` to accept `int \| None`
Unit test removals and additions `tests/unit/models/policy/test_megatron_worker.py`	Removed `average_in_collective` from `create_megatron_test_config`; added imports (numpy, ray); added `test_megatron_gradient_norm_consistency_across_parallelism()` validating gradient norms and losses across DP1TP1, DP2, TP2 configurations with cross-configuration comparisons
Generation test updates `tests/unit/models/generation/test_vllm_generation.py`	Removed `average_in_collective` flag from `get_basic_megatron_test_config`
Test suite and harness additions `tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh`, `tests/test_suites/nightly.txt`	Added new SFT integration test script with loss validation metrics (train/loss < 0.301, validation/val_loss < 0.304 at step 80) and registered test in nightly suite
Tool configuration updates `tools/refit_verifier.py`	Removed `average_in_collective` flag from Megatron distributed data parallel configuration

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

MegatronPolicyWorker.init logic: Verify the conditional setting of average_in_collective=False and its interaction with calculate_per_token_loss, including the MoE auxiliary loss assertion
New tensor parallel attributes method: Ensure parameter traversal and metadata collection are comprehensive and correctly expose tensor-parallel distribution
Gradient norm consistency test: Validate test setup across three parallelism configurations (DP1TP1, DP2, TP2), cross-configuration tolerance thresholds, and parameter sharding verification
Config consistency: Confirm all YAML removals are uniform and don't break dependent configs or introduce inconsistent state

Possibly related PRs

feat: add Megatron support for on-policy distillation #1324: Concurrent modifications to MegatronPolicyWorker and average_in_collective in distributed data parallel config
fix: Fixes to make Megatron backend match dtensor #1389: Nearly identical changes removing average_in_collective across Megatron DDP configuration and worker adjustments
feat: FP8 Training in Megatron Path #971: Overlapping modifications to MegatronPolicyWorker.__init__ with different gradient/FP8 handling logic

Suggested labels

cherry-pick, Run CICD, r0.4.0, CI:L1

Suggested reviewers

terrykong
zpqiu

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	Pull request contains major breaking API changes and behavioral modifications affecting numerical computation, but test results are not documented in PR description.	Document test results, validation metrics, and numerical regression checks in PR description; link related issues and explain breaking changes rationale.
Title check	⚠️ Warning	The title is a cherry-pick operation reference and does not clearly describe the main changes made in this PR (removal of average_in_collective flag and related fixes).	Revise the title to clearly describe the primary changes, such as 'Remove average_in_collective flag from Megatron configurations' or 'Fix Megatron backend to match dtensor behavior'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ashors/cherry-pick-1389

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/unit/models/policy/test_megatron_worker.py (1)
2435-2446: Consider adding strict=True to zip() for robustness.

While the assertion at lines 2431-2433 already ensures matching lengths, adding strict=True to the zip() call would provide an additional safeguard and address the static analysis warning.

Apply this diff:
-            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm)):
+            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm, strict=True)):

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 43f5f6a and 9748484.

📒 Files selected for processing (18)

examples/configs/distillation_math.yaml (0 hunks)
examples/configs/distillation_math_megatron.yaml (0 hunks)
examples/configs/dpo.yaml (0 hunks)
examples/configs/grpo_math_1B.yaml (0 hunks)
examples/configs/grpo_math_1B_megatron.yaml (0 hunks)
examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml (1 hunks)
examples/configs/rm.yaml (0 hunks)
examples/configs/sft.yaml (0 hunks)
examples/configs/sft_openmathinstruct2_megatron.yaml (0 hunks)
examples/configs/vlm_grpo_3B.yaml (0 hunks)
examples/configs/vlm_grpo_3B_megatron.yaml (0 hunks)
nemo_rl/models/policy/__init__.py (1 hunks)
nemo_rl/models/policy/megatron_policy_worker.py (3 hunks)
tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh (1 hunks)
tests/test_suites/nightly.txt (1 hunks)
tests/unit/models/generation/test_vllm_generation.py (0 hunks)
tests/unit/models/policy/test_megatron_worker.py (2 hunks)
tools/refit_verifier.py (0 hunks)

💤 Files with no reviewable changes (12)

examples/configs/distillation_math.yaml
examples/configs/vlm_grpo_3B.yaml
examples/configs/sft_openmathinstruct2_megatron.yaml
examples/configs/vlm_grpo_3B_megatron.yaml
examples/configs/rm.yaml
tools/refit_verifier.py
examples/configs/grpo_math_1B_megatron.yaml
examples/configs/grpo_math_1B.yaml
examples/configs/dpo.yaml
tests/unit/models/generation/test_vllm_generation.py
examples/configs/distillation_math_megatron.yaml
examples/configs/sft.yaml

🧰 Additional context used

📓 Path-based instructions (10)

**/*.py