[DO NOT MERGE] Combined MiMo non-colocated changes for MBridge integration by yashaswikarnati · Pull Request #4022 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-03-24T23:41:53Z

DO NOT MERGE — This is a temporary integration branch for MBridge PRs.

Combines all pending MiMo non-colocated MCore changes so MBridge can bump its MCore submodule to a single ref:

Add MimoOptimizer for heterogeneous parallelism #4019 — MimoOptimizer for heterogeneous parallelism
Add distributed checkpoint support for non-colocated MiMo #4020 — Distributed checkpoint support for non-colocated MiMo
Fix 2D tensor communication for asymmetric DP in Bridge Communicator #4021 — Fix 2D tensor communication for asymmetric DP in bridge

This branch will be closed once the individual PRs are merged to main.

Used by MBridge PRs:

MB#2869 — Phase 4
MB#2870 — Phase 5

_prepare_tensor_for_comm() always inserted a singleton dim at position -1 (the end), regardless of dim_mapping. With SBH format, the bridge operates on dim_mapping['b']=1, but after unsqueeze(-1), dim 1 is the hidden dimension, not batch. This caused incorrect cat/split operations when DP sizes differ between modules (fan-in/fan-out). Fix: add tensor_ndim parameter to BridgeCommunicator. For 2D tensors [B*S, H], batch is folded into dim 0, so fan-in/fan-out uses cat/split at dim 0 directly — no unsqueeze/squeeze needed. Each bridge gets tensor_ndim from module_output_ndim config in the communicator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-03-24T23:41:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Fixes three issues with dist_checkpointing in non-colocated MiMo: 1. sharded_state_dict() on MimoModel and ModalitySubmodules now injects dp_cp_group from each module's pg_collection, bypassing parallel_state global fallbacks that crash in non-colocated mode. 2. MimoOptimizer.sharded_state_dict() extracts param_groups and grad_scaler as ShardedObjects routed through distributed save, fixing the issue where common.pt is only written by global rank 0 (the encoder rank) and LLM optimizer metadata was lost. 3. ModalitySubmodules gains sharded_state_dict() for TP-aware checkpointing (previously all tensors treated as TP-replicated). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yashaswikarnati force-pushed the yash/mcore-mimo-combined branch 2 times, most recently from e337e71 to f8cd5d9 Compare March 25, 2026 05:39

yashaswikarnati and others added 2 commits March 24, 2026 23:38

Merge branch 'yash/nmfw-47-dp-fan-in-fix' into yash/mcore-mimo-combined

10b3ddd

yashaswikarnati force-pushed the yash/mcore-mimo-combined branch from f8cd5d9 to 10b3ddd Compare March 25, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Combined MiMo non-colocated changes for MBridge integration#4022

[DO NOT MERGE] Combined MiMo non-colocated changes for MBridge integration#4022
yashaswikarnati wants to merge 3 commits intoNVIDIA:mainfrom
yashaswikarnati:yash/mcore-mimo-combined

yashaswikarnati commented Mar 24, 2026

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yashaswikarnati commented Mar 24, 2026

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant