Add distributed checkpoint support for non-colocated MiMo by yashaswikarnati · Pull Request #4020 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-03-24T19:21:12Z

Summary

Stacked on #4019 (MimoOptimizer). Implements NMFW-33.

Fixes distributed checkpointing for non-colocated MiMo where encoder and LLM run on separate rank sets with per-module process groups:

MimoModel.sharded_state_dict(): injects dp_cp_group from each module's pg_collection, bypassing parallel_state global fallbacks that crash in non-colocated mode
MimoOptimizer.sharded_state_dict(): extracts param_groups and grad_scaler as ShardedObjects routed through distributed save, fixing the issue where common.pt is only written by global rank 0 (encoder rank) and LLM optimizer metadata was lost
ModalitySubmodules.sharded_state_dict(): enables TP-aware checkpointing (previously all tensors treated as TP-replicated)
MultimodalProjector: accepts pg_collection for correct tp_group

Test plan

Save/load roundtrip tests for model params and optimizer state
4-GPU: LLM PP=3
8-GPU: Encoder TP=2, LLM TP=2 PP=3
8-GPU: Encoder TP=1, LLM TP=1 PP=7
8-GPU: Encoder TP=2 PP=2, LLM TP=2 PP=2

🤖 Generated with Claude Code

copy-pr-bot · 2026-03-24T19:21:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-24T19:21:22Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

megatron/core/models/mimo/submodules/base.py

yashaswikarnati · 2026-03-25T00:58:34Z

/claude review

claude · 2026-03-25T00:59:50Z

megatron/core/models/mimo/optimizer.py

+
+            info.optimizer.load_state_dict(module_sd)
+
+    def sharded_state_dict(self, model_sharded_state_dict, is_loading: bool = False, **kwargs):


Critical bug: The old sharded_state_dict method (the simple 4-line version that was already in the file) still exists below this new implementation. In Python, the last definition of a method wins, so this entire new method is dead code — the old simple version at line ~231 silently shadows it.

The old method needs to be deleted for this PR to have any effect.

yashaswikarnati · 2026-03-25T05:42:33Z

/claude review

claude

LGTM

Fixes three issues with dist_checkpointing in non-colocated MiMo: 1. sharded_state_dict() on MimoModel and ModalitySubmodules now injects dp_cp_group from each module's pg_collection, bypassing parallel_state global fallbacks that crash in non-colocated mode. 2. MimoOptimizer.sharded_state_dict() extracts param_groups and grad_scaler as ShardedObjects routed through distributed save, fixing the issue where common.pt is only written by global rank 0 (the encoder rank) and LLM optimizer metadata was lost. 3. ModalitySubmodules gains sharded_state_dict() for TP-aware checkpointing (previously all tensors treated as TP-replicated). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dimapihtar

LGTM. Thank you!

yashaswikarnati · 2026-03-25T14:28:11Z

/ok to test 69a49c2

svcnvidia-nemo-ci · 2026-03-25T16:37:04Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23552500799

yashaswikarnati requested review from a team as code owners March 24, 2026 19:21

svcnvidia-nemo-ci marked this pull request as draft March 24, 2026 19:21

yashaswikarnati force-pushed the yash/mimo-checkpoint-pr branch 3 times, most recently from 0db5afb to 673fe9c Compare March 24, 2026 23:17

yashaswikarnati mentioned this pull request Mar 24, 2026

[DO NOT MERGE] Combined MiMo non-colocated changes for MBridge integration #4022

Draft

yaoyu-33 reviewed Mar 25, 2026

View reviewed changes

megatron/core/models/mimo/submodules/base.py Show resolved Hide resolved

yashaswikarnati force-pushed the yash/mimo-checkpoint-pr branch 2 times, most recently from 392dd1f to b7050e4 Compare March 25, 2026 00:56

yashaswikarnati marked this pull request as ready for review March 25, 2026 00:57

svcnvidia-nemo-ci requested a review from a team March 25, 2026 00:57

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 25, 2026

svcnvidia-nemo-ci added the complexity: medium label Mar 25, 2026

claude bot reviewed Mar 25, 2026

View reviewed changes

yashaswikarnati force-pushed the yash/mimo-checkpoint-pr branch from b7050e4 to 09c3fd5 Compare March 25, 2026 05:39

claude bot approved these changes Mar 25, 2026

View reviewed changes

yashaswikarnati force-pushed the yash/mimo-checkpoint-pr branch from 09c3fd5 to 69a49c2 Compare March 25, 2026 06:40

kvareddy approved these changes Mar 25, 2026

View reviewed changes

dimapihtar approved these changes Mar 25, 2026

View reviewed changes

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 25, 2026

copy-pr-bot bot temporarily deployed to test March 25, 2026 14:29 Inactive

yaoyu-33 approved these changes Mar 25, 2026

View reviewed changes

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Mar 25, 2026

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Mar 25, 2026

yashaswikarnati added this pull request to the merge queue Mar 25, 2026

Merged via the queue into NVIDIA:main with commit 1df5591 Mar 25, 2026
67 of 69 checks passed

yashaswikarnati deleted the yash/mimo-checkpoint-pr branch March 25, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed checkpoint support for non-colocated MiMo#4020

Add distributed checkpoint support for non-colocated MiMo#4020
yashaswikarnati merged 1 commit intoNVIDIA:mainfrom
yashaswikarnati:yash/mimo-checkpoint-pr

yashaswikarnati commented Mar 24, 2026

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

yashaswikarnati commented Mar 25, 2026

Uh oh!

claude bot Mar 25, 2026

Uh oh!

yashaswikarnati commented Mar 25, 2026

Uh oh!

claude bot left a comment

Uh oh!

dimapihtar left a comment

Uh oh!

yashaswikarnati commented Mar 25, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		info.optimizer.load_state_dict(module_sd)

		def sharded_state_dict(self, model_sharded_state_dict, is_loading: bool = False, **kwargs):

Conversation

yashaswikarnati commented Mar 24, 2026

Summary

Test plan

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

yashaswikarnati commented Mar 25, 2026

Uh oh!

claude bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati commented Mar 25, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

dimapihtar left a comment

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati commented Mar 25, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants