Skip to content

Add MimoOptimizer for heterogeneous parallelism#4019

Merged
yashaswikarnati merged 1 commit intoNVIDIA:mainfrom
yashaswikarnati:yash/mimo-optimizer-pr
Mar 25, 2026
Merged

Add MimoOptimizer for heterogeneous parallelism#4019
yashaswikarnati merged 1 commit intoNVIDIA:mainfrom
yashaswikarnati:yash/mimo-optimizer-pr

Conversation

@yashaswikarnati
Copy link
Copy Markdown
Contributor

Summary

Replaces #3212 (closed when base branch pull-request/3211 was deleted after #3211 merged).

Adds optimizer support for MIMO models where different modules (encoder, LLM) can have different DP/TP/PP configurations.

  • MimoOptimizer class managing per-module MegatronOptimizer instances
  • Global gradient norm via all_reduce MAX across module boundaries
  • Module-aware gradient clipping using the global norm
  • Module-keyed state dicts for checkpointing
  • intra_dist_opt group spans full module world ["tp","cp","ep","pp","dp"] matching standard Megatron's intra_distributed_optimizer_instance_group
  • Assert num_distributed_optimizer_instances == 1 (multi-instance not yet supported)
  • HyperCommGrid.is_current_rank_in_grid() helper

Test plan

  • Unit tests pass (test_mimo_optimizer.py)
  • 2-GPU integration test (test_baseline_2gpu)
  • 4-GPU integration test (test_lm_pp3_4gpu)
  • 8-GPU integration tests (test_encoder_tp2_llm_tp2_pp3_8gpu, test_full_pp_8gpu)

🤖 Generated with Claude Code

@yashaswikarnati yashaswikarnati requested review from a team as code owners March 24, 2026 18:30
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft March 24, 2026 18:30
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@yashaswikarnati yashaswikarnati force-pushed the yash/mimo-optimizer-pr branch from 9659951 to 6a8ff0c Compare March 24, 2026 18:46
@yashaswikarnati yashaswikarnati marked this pull request as ready for review March 24, 2026 18:54
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 24, 2026 18:54
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Final Review PR is in the "final review" stage complexity: medium labels Mar 24, 2026
@yashaswikarnati
Copy link
Copy Markdown
Contributor Author

/claude review

@yashaswikarnati
Copy link
Copy Markdown
Contributor Author

/ok to test e48ecde

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 24, 2026
@yashaswikarnati yashaswikarnati force-pushed the yash/mimo-optimizer-pr branch from e48ecde to 6b743fd Compare March 24, 2026 19:33
@yashaswikarnati
Copy link
Copy Markdown
Contributor Author

/claude review

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Mar 24, 2026
Adds optimizer support for MIMO models where different modules
(encoder, LLM) can have different DP/TP/PP configurations.

- MimoOptimizer class managing per-module MegatronOptimizer instances
- Global gradient norm via all_reduce MAX across module boundaries
- Module-aware gradient clipping using the global norm
- Module-keyed state dicts for checkpointing
- intra_dist_opt group spans full module world ["tp","cp","ep","pp","dp"]
  matching standard Megatron's intra_distributed_optimizer_instance_group
- Assert num_distributed_optimizer_instances == 1 (multi-instance not yet supported)
- HyperCommGrid.is_current_rank_in_grid() helper
- Optimizer integrated into existing 1F1B schedule tests (8-GPU)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yashaswikarnati
Copy link
Copy Markdown
Contributor Author

/ok to test bfb508d

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23518683621

Merged via the queue into NVIDIA:main with commit d86ba0b Mar 25, 2026
64 checks passed
@yashaswikarnati yashaswikarnati deleted the yash/mimo-optimizer-pr branch March 25, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants