fix: Fixes to make Megatron backend match dtensor by ashors1 · Pull Request #1389 · NVIDIA-NeMo/RL

ashors1 · 2025-10-17T23:33:45Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added supervised fine-tuning recipe configuration for Qwen2.5-Math-7B with Megatron tensor parallelism support.
Tests
- Added gradient norm consistency validation tests across different parallelism configurations.
Chores
- Simplified configuration format by removing deprecated distributed data parallel settings from multiple configuration files.
- Enhanced internal gradient handling for per-token loss computation in policy training.

Signed-off-by: ashors1 <ashors@nvidia.com>

nemo_rl/models/policy/megatron_policy_worker.py

Signed-off-by: ashors1 <ashors@nvidia.com>

Signed-off-by: Anna Shors <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors@nvidia.com>

nemo_rl/models/policy/megatron_policy_worker.py

Signed-off-by: ashors1 <ashors@nvidia.com>

…e-scale

Signed-off-by: Anna Shors <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors@nvidia.com>

coderabbitai · 2025-10-30T16:20:39Z

📝 Walkthrough

Walkthrough

The PR removes the average_in_collective configuration flag from MegatronDDPConfig across example configs and tests, adds conditional logic to override it when per-token loss calculation is enabled, and introduces a new SFT recipe for Qwen2.5-Math-7B with Megatron-based tensor model parallelism.

Changes

Cohort / File(s)	Summary
Config flag removal `examples/configs/distillation_math.yaml`, `examples/configs/distillation_math_megatron.yaml`, `examples/configs/dpo.yaml`, `examples/configs/grpo_math_1B.yaml`, `examples/configs/grpo_math_1B_megatron.yaml`, `examples/configs/rm.yaml`, `examples/configs/sft.yaml`, `examples/configs/sft_openmathinstruct2_megatron.yaml`, `examples/configs/vlm_grpo_3B.yaml`, `examples/configs/vlm_grpo_3B_megatron.yaml`	Removed `average_in_collective: true` from `policy.megatron_cfg.distributed_data_parallel_config` in all referenced example configurations.
New SFT recipe `examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml`	New SFT recipe configuration for Qwen2.5-Math-7B with Megatron setup including tensor model parallelism (TP size 4), context parallelism, sequence packing, optimizer settings, and WandB/TensorBoard logging for 2-node 8-GPU cluster.
TypedDict field removal `nemo_rl/models/policy/__init__.py`	Removed `average_in_collective: bool` field from `MegatronDDPConfig` TypedDict definition.
Policy worker gradient handling `nemo_rl/models/policy/megatron_policy_worker.py`	Added GPU/TP gradient handling: enable `calculate_per_token_loss` and `perform_initialization` in model config, add MoE aux loss compatibility assertion, override `average_in_collective` to False when per-token loss is True, and introduce new `check_tensor_parallel_attributes()` utility method to inspect model parameters' TP attributes.
Test config updates `tests/unit/models/generation/test_vllm_generation.py`, `tests/unit/models/policy/test_megatron_worker.py`, `tools/refit_verifier.py`	Removed `average_in_collective` entry from distributed data parallel configs in test/tool configurations.
New test infrastructure `tests/unit/models/policy/test_megatron_worker.py`	Added imports (numpy, ray), new test function `test_megatron_gradient_norm_consistency_across_parallelism()` to validate gradient norms and losses across multiple DP/TP configurations.
New SFT test script `tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh`	New Bash test script that runs SFT training with Qwen2.5-Math-7B-Megatron, converts TensorBoard logs to JSON, and conditionally validates metrics against configured thresholds.
Test suite registration `tests/test_suites/nightly.txt`	Added test entry `tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh` under SFT section of nightly test suite with comment "validate TP/DP".

Sequence Diagram

sequenceDiagram
    participant Config as Training Config
    participant PolicyWorker as MegatronPolicyWorker
    participant Model as Model Config
    
    Config->>PolicyWorker: Initialize with policy settings
    PolicyWorker->>Model: Enable calculate_per_token_loss
    PolicyWorker->>Model: Enable perform_initialization
    
    alt calculate_per_token_loss is True
        PolicyWorker->>Model: Override average_in_collective = False
        Note over PolicyWorker,Model: Avoid MCore assertion errors
    else calculate_per_token_loss is False
        Note over PolicyWorker,Model: Use default average_in_collective
    end
    
    PolicyWorker->>PolicyWorker: check_tensor_parallel_attributes()
    Note over PolicyWorker: Inspect TP status, partition dims,<br/>shapes across model parameters

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

nemo_rl/models/policy/megatron_policy_worker.py: Requires careful review of the conditional logic for overriding average_in_collective and validation that the new check_tensor_parallel_attributes() method correctly inspects model parameters.
tests/unit/models/policy/test_megatron_worker.py: New gradient norm consistency test function involves multi-configuration training and comparison logic that requires verification of test correctness and assertion thresholds.
examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml: Review hyperparameter choices, distributed settings, and data configuration for the new recipe.
Remaining config removals are repetitive and require minimal effort per file.

Possibly related PRs

feat: add Megatron support for on-policy distillation #1324: Modifies nemo_rl/models/policy/megatron_policy_worker.py and distributed data parallel config handling for Megatron behavior.
cp: feat: add Megatron support for on-policy distillation (1324) into r0.4.0 #1398: Updates Megatron policy implementation and distributed_data_parallel_config keys across the same example config files.
chore: use pydantic for yaml test validation #1382: Restructures policy configuration TypedDicts in nemo_rl/models/policy/__init__.py related to MegatronConfig definitions.

Suggested reviewers

zpqiu
parthchadha
yfw

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	The PR description in GitHub is incomplete—it contains only placeholder text from a template ("Add a one line overview of what this PR aims to accomplish" without actual content) and unchecked checklists. However, the CodeRabbit auto-generated summary does mention that "gradient norm consistency validation tests across different parallelism configurations" were added, and the raw summary confirms a new test function `test_megatron_gradient_norm_consistency_across_parallelism` and test script `sft-qwen2.5-math-7b-megatron.sh` are included. While these tests exist in the code, the PR description itself does not document the actual test results, convergence validation, or confirmation that these changes do not introduce training regressions—which is critical since the changes modify distributed gradient averaging behavior that directly affects training numerics.	The PR description must be updated to include explicit test results and validation data. Specifically, add: (1) results from running the new gradient norm consistency tests showing they pass across different parallelism configurations, (2) convergence validation on at least one affected model (e.g., the Qwen2.5-Math-7B SFT recipe) comparing training metrics before and after these changes, and (3) confirmation that the nightly test suite validations pass. This documentation is essential since the changes affect distributed training gradient averaging—a numerically sensitive component—and the current template-only PR description does not demonstrate there are no regressions.
Title Check	❓ Inconclusive	The title "fix: Fixes to make Megatron backend match dtensor" is related to the pull request's stated objective of aligning Megatron with dtensor behavior. However, the title lacks specificity about the actual changes made. The primary modifications involve removing the `average_in_collective` configuration parameter across multiple files, removing it from the MegatronDDPConfig TypedDict definition, and adding new gradient handling logic. A teammate scanning the commit history would understand that the PR addresses some compatibility issue but would not clearly grasp what specific changes were implemented. Additionally, the repetitive phrasing "Fixes to fix" suggests imprecision in the title composition.	Consider revising the title to be more specific about the key changes, such as "Remove average_in_collective config to align Megatron DDP with dtensor" or "Fix Megatron backend DDP configuration to match dtensor behavior." This would clearly communicate the primary technical change while maintaining the fix-focused intent.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ashors/fix-mcore-scale

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/unit/models/policy/test_megatron_worker.py (1)
2436-2436: Consider adding explicit strict=True to zip() call.

While the test already asserts equal length on lines 2432-2434, adding strict=True to the zip() call would provide additional runtime enforcement and follows Python 3.10+ best practices. Since the coding guidelines target Python 3.12+, this parameter is available.

Apply this diff:
-            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm)):
+            for i, (gn, ref_gn) in enumerate(zip(grad_norm, reference_grad_norm, strict=True)):
Based on static analysis hint from Ruff.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3aac89 and f2cb2ba.

📒 Files selected for processing (18)

examples/configs/distillation_math.yaml (0 hunks)
examples/configs/distillation_math_megatron.yaml (0 hunks)
examples/configs/dpo.yaml (0 hunks)
examples/configs/grpo_math_1B.yaml (0 hunks)
examples/configs/grpo_math_1B_megatron.yaml (0 hunks)
examples/configs/recipes/llm/sft-qwen2.5-math-7b-megatron.yaml (1 hunks)
examples/configs/rm.yaml (0 hunks)
examples/configs/sft.yaml (0 hunks)
examples/configs/sft_openmathinstruct2_megatron.yaml (0 hunks)
examples/configs/vlm_grpo_3B.yaml (0 hunks)
examples/configs/vlm_grpo_3B_megatron.yaml (0 hunks)
nemo_rl/models/policy/__init__.py (0 hunks)
nemo_rl/models/policy/megatron_policy_worker.py (3 hunks)
tests/test_suites/llm/sft-qwen2.5-math-7b-megatron.sh (1 hunks)
tests/test_suites/nightly.txt (1 hunks)
tests/unit/models/generation/test_vllm_generation.py (0 hunks)
tests/unit/models/policy/test_megatron_worker.py (2 hunks)
tools/refit_verifier.py (0 hunks)

💤 Files with no reviewable changes (13)

examples/configs/sft_openmathinstruct2_megatron.yaml
examples/configs/dpo.yaml
examples/configs/distillation_math.yaml
examples/configs/vlm_grpo_3B_megatron.yaml
examples/configs/vlm_grpo_3B.yaml
examples/configs/grpo_math_1B.yaml
tools/refit_verifier.py
examples/configs/grpo_math_1B_megatron.yaml
tests/unit/models/generation/test_vllm_generation.py
examples/configs/sft.yaml
examples/configs/distillation_math_megatron.yaml
examples/configs/rm.yaml
nemo_rl/models/policy/init.py

🧰 Additional context used

📓 Path-based instructions (10)

tests/test_suites/nightly.txt