Skip to content

fix(mimo): adapt training layer for MCore submodule bump#2979

Merged
yaoyu-33 merged 1 commit intoNVIDIA-NeMo:mimo/phase5-checkpointing-rebuildfrom
yashaswikarnati:mimo/phase5-mcore-bump-fixes
Mar 25, 2026
Merged

fix(mimo): adapt training layer for MCore submodule bump#2979
yaoyu-33 merged 1 commit intoNVIDIA-NeMo:mimo/phase5-checkpointing-rebuildfrom
yashaswikarnati:mimo/phase5-mcore-bump-fixes

Conversation

@yashaswikarnati
Copy link
Copy Markdown
Contributor

Summary

  • Add module_output_ndim to MultiModulePipelineCommunicator for correct 2D/3D tensor routing (vision encoders produce 2D [S, H], LLM produces 3D [S, B, H])
  • Use MIMO_LANGUAGE_MODULE_KEY instead of removed role.language_module_name attribute in mimo_step.py and train_mimo.py
  • Remove language_module_key assertion from pretrain_mimo.py (removed from MimoModelConfig)
  • Clean up stale language_module_key / language_module_name references in test mocks

Depends on: #2978 (phase 4 model layer fixes)

Test plan

  • Existing MIMO unit tests pass (159/162, 3 pre-existing failures in test_mimo_step)
  • E2e training test passes on 8 GPUs (torchrun --nproc_per_node=8 tests/e2e/mimo/test_mimo_training_e2e.py)

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yashaswikarnati yashaswikarnati force-pushed the mimo/phase5-mcore-bump-fixes branch 3 times, most recently from 339f237 to 8959b16 Compare March 25, 2026 07:29
@aroshanghias-nvd aroshanghias-nvd force-pushed the mimo/phase5-checkpointing-rebuild branch from cd3f7fc to e4d2fdf Compare March 25, 2026 17:01
@aroshanghias-nvd aroshanghias-nvd force-pushed the mimo/phase5-mcore-bump-fixes branch from 8959b16 to 339d640 Compare March 25, 2026 17:09
@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic area:training Training loop, callbacks, and runtime integration bug Something isn't working labels Mar 25, 2026
@yaoyu-33 yaoyu-33 merged commit 28aa989 into NVIDIA-NeMo:mimo/phase5-checkpointing-rebuild Mar 25, 2026
2 checks passed
@aroshanghias-nvd
Copy link
Copy Markdown
Contributor

cross_entropy_loss_fusion=True — Is this required by the new MCore, or is it an independent improvement? It's only added in the checkpoint resume test's _make_language_config(), not in the other e2e tests (test_mimo_training_e2e.py, test_mimo_training_llava.py). Should those tests also get this flag for consistency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic area:training Training loop, callbacks, and runtime integration bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants