fix(mimo): adapt training layer for MCore submodule bump by yashaswikarnati · Pull Request #2979 · NVIDIA-NeMo/Megatron-Bridge

yashaswikarnati · 2026-03-25T06:34:17Z

Summary

Add module_output_ndim to MultiModulePipelineCommunicator for correct 2D/3D tensor routing (vision encoders produce 2D [S, H], LLM produces 3D [S, B, H])
Use MIMO_LANGUAGE_MODULE_KEY instead of removed role.language_module_name attribute in mimo_step.py and train_mimo.py
Remove language_module_key assertion from pretrain_mimo.py (removed from MimoModelConfig)
Clean up stale language_module_key / language_module_name references in test mocks

Depends on: #2978 (phase 4 model layer fixes)

Test plan

Existing MIMO unit tests pass (159/162, 3 pre-existing failures in test_mimo_step)
E2e training test passes on 8 GPUs (torchrun --nproc_per_node=8 tests/e2e/mimo/test_mimo_training_e2e.py)

🤖 Generated with Claude Code

copy-pr-bot · 2026-03-25T06:34:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

aroshanghias-nvd · 2026-03-25T17:52:43Z

cross_entropy_loss_fusion=True — Is this required by the new MCore, or is it an independent improvement? It's only added in the checkpoint resume test's _make_language_config(), not in the other e2e tests (test_mimo_training_e2e.py, test_mimo_training_llava.py). Should those tests also get this flag for consistency?

yashaswikarnati force-pushed the mimo/phase5-mcore-bump-fixes branch 3 times, most recently from 339f237 to 8959b16 Compare March 25, 2026 07:29

aroshanghias-nvd force-pushed the mimo/phase5-checkpointing-rebuild branch from cd3f7fc to e4d2fdf Compare March 25, 2026 17:01

fix(mimo): bump MCore submodule and migrate llm key to language

339d640

aroshanghias-nvd force-pushed the mimo/phase5-mcore-bump-fixes branch from 8959b16 to 339d640 Compare March 25, 2026 17:09

yaoyu-33 added area:model Model implementations and HF bridge logic area:training Training loop, callbacks, and runtime integration bug Something isn't working labels Mar 25, 2026

yaoyu-33 approved these changes Mar 25, 2026

View reviewed changes

yaoyu-33 merged commit 28aa989 into NVIDIA-NeMo:mimo/phase5-checkpointing-rebuild Mar 25, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mimo): adapt training layer for MCore submodule bump#2979

fix(mimo): adapt training layer for MCore submodule bump#2979
yaoyu-33 merged 1 commit intoNVIDIA-NeMo:mimo/phase5-checkpointing-rebuildfrom
yashaswikarnati:mimo/phase5-mcore-bump-fixes

yashaswikarnati commented Mar 25, 2026

Uh oh!

copy-pr-bot bot commented Mar 25, 2026

Uh oh!

Uh oh!

aroshanghias-nvd commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yashaswikarnati commented Mar 25, 2026

Summary

Test plan

Uh oh!

copy-pr-bot bot commented Mar 25, 2026

Uh oh!

Uh oh!

aroshanghias-nvd commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants