[DO NOT MERGE] Combined MiMo non-colocated changes for MBridge integration#4022
Draft
yashaswikarnati wants to merge 3 commits intoNVIDIA:mainfrom
Draft
[DO NOT MERGE] Combined MiMo non-colocated changes for MBridge integration#4022yashaswikarnati wants to merge 3 commits intoNVIDIA:mainfrom
yashaswikarnati wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
_prepare_tensor_for_comm() always inserted a singleton dim at position -1 (the end), regardless of dim_mapping. With SBH format, the bridge operates on dim_mapping['b']=1, but after unsqueeze(-1), dim 1 is the hidden dimension, not batch. This caused incorrect cat/split operations when DP sizes differ between modules (fan-in/fan-out). Fix: add tensor_ndim parameter to BridgeCommunicator. For 2D tensors [B*S, H], batch is folded into dim 0, so fan-in/fan-out uses cat/split at dim 0 directly — no unsqueeze/squeeze needed. Each bridge gets tensor_ndim from module_output_ndim config in the communicator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e337e71 to
f8cd5d9
Compare
Fixes three issues with dist_checkpointing in non-colocated MiMo: 1. sharded_state_dict() on MimoModel and ModalitySubmodules now injects dp_cp_group from each module's pg_collection, bypassing parallel_state global fallbacks that crash in non-colocated mode. 2. MimoOptimizer.sharded_state_dict() extracts param_groups and grad_scaler as ShardedObjects routed through distributed save, fixing the issue where common.pt is only written by global rank 0 (the encoder rank) and LLM optimizer metadata was lost. 3. ModalitySubmodules gains sharded_state_dict() for TP-aware checkpointing (previously all tensors treated as TP-replicated). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f8cd5d9 to
10b3ddd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Combines all pending MiMo non-colocated MCore changes so MBridge can bump its MCore submodule to a single ref:
This branch will be closed once the individual PRs are merged to main.
Used by MBridge PRs: