Skip to content

feat(mimo): Phase 5 - checkpoint save/resume, evaluation, e2e tests#2870

Open
aroshanghias-nvd wants to merge 5 commits intomimo/phase4-training-rebuildfrom
mimo/phase5-checkpointing-rebuild
Open

feat(mimo): Phase 5 - checkpoint save/resume, evaluation, e2e tests#2870
aroshanghias-nvd wants to merge 5 commits intomimo/phase4-training-rebuildfrom
mimo/phase5-checkpointing-rebuild

Conversation

@aroshanghias-nvd
Copy link
Copy Markdown
Contributor

Summary

Adds checkpoint save/resume and evaluation support for MiMo models, stacked on the Phase 4 training PR (#2869).

  • Checkpoint save/resume: Wiring for heterogeneous MiMo models through checkpointing.py, pretrain_mimo, and train_mimo, with torch_dist format support and access-pattern validation bypass for nested DDP language model tensors in PP>1
  • Evaluation infrastructure: MiMo extensions to eval.py and distributed batch slicing (dp_utils.slice_batch_for_mimo) for evaluation across heterogeneous DP groups
  • DDP checkpoint support: set_input_tensor method proxying on DDP-wrapped language model for correct PP decoder input wiring during checkpoint resume
  • E2E test suite: training e2e tests, heterogeneous LLaVA training test, checkpoint save→resume round-trip test with parallelism config matrix, and parallelism test runner
  • Checkpoint unit tests: 1159 lines of checkpoint save/resume coverage

Validation

  • 162 unit tests passed (Phase 4 regression + Phase 5 checkpointing)
  • E2e parallelism test passed (baseline_dp_only, 8 GPUs)
  • E2e checkpoint save→resume passed for all 4 configs: dp4_both, tp4_both, tp2_dp2_both, pp2_llm_dp4_vision

Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aroshanghias-nvd aroshanghias-nvd force-pushed the mimo/phase4-training-rebuild branch from ee23945 to 9642406 Compare March 18, 2026 01:28
@aroshanghias-nvd aroshanghias-nvd force-pushed the mimo/phase5-checkpointing-rebuild branch 2 times, most recently from 1e7481d to a49c524 Compare March 18, 2026 01:31
@yaoyu-33 yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths feature New capabilities, enhancements, or enablement work labels Mar 19, 2026
Squash of all Phase 5 MiMo checkpointing/evaluation work from
mimo/phase5-checkpointing (989842e), stacked on Phase 4 rebuild.

Includes:
- Checkpoint save/resume wiring for heterogeneous MIMO models
- MiMo evaluation infrastructure (eval.py MIMO extensions)
- Distributed batch slicing for evaluation (dp_utils.slice_batch_for_mimo)
- E2E training tests (test_mimo_training_e2e, test_mimo_training_llava)
- E2E checkpoint resume tests (test_mimo_checkpoint_resume_e2e)
- Parallelism test runner (run_mimo_parallelism_tests.sh)
- Full checkpoint unit test coverage (test_mimo_checkpointing — 1159 lines)

Original commit history preserved in backup/mimo-phase5-checkpointing-v0 (989842e).

Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
…iring

The _make_setup_output fixture was missing pg_collections and
checkpointing_context attributes needed by the Phase 5 checkpoint
code path in pretrain_mimo. Also set checkpoint config fields to None
and build_data_iterators_fn return value so the test completes
without hitting unrelated code paths.

Pre-existing test gap at 989842e.

Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
…int test

Restores the checkpoint resume test wrapper from mimo/wip-phase4-training.
Runs save→resume round-trip across multiple parallelism configs.

Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
@aroshanghias-nvd aroshanghias-nvd force-pushed the mimo/phase5-checkpointing-rebuild branch from cd3f7fc to e4d2fdf Compare March 25, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ckpt Checkpoint conversion, loading, export, and save paths feature New capabilities, enhancements, or enablement work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants