feat(mimo): Phase 5 - checkpoint save/resume, evaluation, e2e tests#2870
Open
aroshanghias-nvd wants to merge 5 commits intomimo/phase4-training-rebuildfrom
Open
feat(mimo): Phase 5 - checkpoint save/resume, evaluation, e2e tests#2870aroshanghias-nvd wants to merge 5 commits intomimo/phase4-training-rebuildfrom
aroshanghias-nvd wants to merge 5 commits intomimo/phase4-training-rebuildfrom
Conversation
ee23945 to
9642406
Compare
1e7481d to
a49c524
Compare
Squash of all Phase 5 MiMo checkpointing/evaluation work from mimo/phase5-checkpointing (989842e), stacked on Phase 4 rebuild. Includes: - Checkpoint save/resume wiring for heterogeneous MIMO models - MiMo evaluation infrastructure (eval.py MIMO extensions) - Distributed batch slicing for evaluation (dp_utils.slice_batch_for_mimo) - E2E training tests (test_mimo_training_e2e, test_mimo_training_llava) - E2E checkpoint resume tests (test_mimo_checkpoint_resume_e2e) - Parallelism test runner (run_mimo_parallelism_tests.sh) - Full checkpoint unit test coverage (test_mimo_checkpointing — 1159 lines) Original commit history preserved in backup/mimo-phase5-checkpointing-v0 (989842e). Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
…iring The _make_setup_output fixture was missing pg_collections and checkpointing_context attributes needed by the Phase 5 checkpoint code path in pretrain_mimo. Also set checkpoint config fields to None and build_data_iterators_fn return value so the test completes without hitting unrelated code paths. Pre-existing test gap at 989842e. Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
…int test Restores the checkpoint resume test wrapper from mimo/wip-phase4-training. Runs save→resume round-trip across multiple parallelism configs. Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
cd3f7fc to
e4d2fdf
Compare
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds checkpoint save/resume and evaluation support for MiMo models, stacked on the Phase 4 training PR (#2869).
checkpointing.py,pretrain_mimo, andtrain_mimo, withtorch_distformat support and access-pattern validation bypass for nested DDP language model tensors in PP>1eval.pyand distributed batch slicing (dp_utils.slice_batch_for_mimo) for evaluation across heterogeneous DP groupsset_input_tensormethod proxying on DDP-wrapped language model for correct PP decoder input wiring during checkpoint resumeValidation
baseline_dp_only, 8 GPUs)dp4_both,tp4_both,tp2_dp2_both,pp2_llm_dp4_visionStack