Add multi-module heterogeneous parallelism support for MIMO model by yashaswikarnati · Pull Request #3211 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-02-02T23:07:41Z

Summary

Adds support for running encoder and language modules on separate pipeline-parallel grids in MIMO models. This enables non-colocated multi-module pipeline parallelism where different modalities can have independent PP configurations.

Add RankRole and ModuleStageInfo data classes for role management
Implement stage-aware forward passes in modality submodules
Update MimoModel to selectively initialize modules based on rank role
Add unit tests for multi-module PP functionality

Dependencies

Depends on Support multimodule pipelining in 1F1B schedule #3129 (multimodule pipeline schedule 1F1B schedule changes)

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

- Rename ProcessGroupCollectionWrapper to MultiModuleProcessGroupCollection - Rename language_model field to language_model_module_name for clarity - Add language_model_module_name param to backward_step_multimodule - Use functools.partial to bind param, keeping signature consistent - Add type hints to _ensure_3d_tensor and _restore_tensor_shape - Move is_multimodule check earlier for validation and backward selection

copy-pr-bot · 2026-02-02T23:07:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Introduce data classes to manage rank roles in multi-module PP setups: - ModuleStageInfo: tracks first/last stage position within a module - RankRole: tracks which modules a rank participates in and their stages These classes enable selective module initialization and stage-aware forward passes when different modules run on separate PP grids. Signed-off-by: ykarnati <ykarnati@nvidia.com>

Enable modality submodules to operate in multi-stage PP configurations: - Add is_first_stage/is_last_stage as immutable properties - First stage: runs encoder on raw inputs - Intermediate stages: pass through hidden states - Last stage: applies input projection before language model Update from_spec() to pass stage info through constructor for proper initialization based on pipeline position. Signed-off-by: ykarnati <ykarnati@nvidia.com>

Add support for running encoder and language modules on separate PP grids: - Determine rank role based on module_to_grid_map configuration - Selective module initialization based on role (encoder-only or LM-only) - Stage-aware forward dispatching based on role - Validate grid map configuration requires language_module_key The forward pass now routes to _forward_encoders or _forward_language_module based on the rank's assigned role in the multi-module PP setup. Signed-off-by: ykarnati <ykarnati@nvidia.com>

Add comprehensive tests for multi-module PP functionality: - test_mimo_role.py: RankRole and ModuleStageInfo data classes - test_mimo_1f1b_schedule.py: 1F1B schedule with multi-module PP - Update existing tests for stage-aware submodule behavior Tests validate role determination, selective initialization, and stage-aware forward passes for both encoder-only and language-only ranks. Signed-off-by: ykarnati <ykarnati@nvidia.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dimapihtar · 2026-02-04T15:34:47Z

/ok to test 5b94c0f

- base.py: Remove redundant conditionals in _initialize_submodules, simplify forward() dispatch with guard-first pattern, collapse _forward_encoders to single-expression conditionals, return None from _determine_role when rank is in no grid - submodules/base.py: Promote encode, combine_embeddings, project_embeddings, and forward from abstract to concrete methods, fix missing f-prefix in error message, fix project_embeddings to always combine before projecting - submodules/vision.py, audio.py: Remove duplicate implementations, keep only __init__ (with projection assertions) and decode - config/role.py: Add __post_init__ validation for language_module_name - from_spec docstring: Document is_first_stage controls output projections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add _make_vlm/_make_avlm/_make_input_ids/_make_position_ids helpers to eliminate repeated 7-arg factory calls and tensor construction - Move device to setup_method, remove 7 duplicate torch.device() lines - Delete dead module-level AudioEncoderWrapper (duplicate of inner class) - Simplify test_state_dict to any() one-liners - Remove redundant assert-not-None before shape checks - Fix hardcoded batch_size=2 to use self.batch_size - Remove test-internal setup assertions that can never fail - Add img_h/img_w/patch_dim attrs to TestMimoModelNonColocated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add colocated flag and RankRole.all_modules() factory so _determine_role always returns a RankRole (never None) - Remove all `if self.role is not None` guards from _initialize_submodules, _initialize_language_model, and forward() - forward() checks self.role.colocated instead of self.role is None - Rank-not-in-any-grid now raises RuntimeError immediately in _determine_role instead of returning None and failing later - Simplify _forward_encoders: pass both encoder_inputs and hidden_states to submodule, let its is_first_stage flag decide which to use - Update test_role_determination to assert colocated role properties Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MIMO always has exactly one language model, so the key doesn't need to be configurable. This removes: - language_module_key field from MimoModelConfig - language_module_name field from RankRole - Validation that language_module_key is set - The or "_language" fallback hack in _determine_role Replaced with a single constant LANGUAGE_MODULE_KEY = "language" in config/role.py, used consistently across base.py and tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace colocated bool with PipelineMode enum (UNIFIED, NON_COLOCATED, COLOCATED) for clear forward path dispatch - Move _determine_role and _validate_grid_map from MimoModel to RankRole.from_grid_map classmethod — MimoModel no longer knows about grids - Rename LANGUAGE_MODULE_KEY to MIMO_LANGUAGE_MODULE_KEY - Type module_to_grid_map as Dict[str, HyperCommGrid] instead of Dict[str, Any] - Remove torch.distributed import from base.py (moved to role.py) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Better describes the spatial arrangement of modules across ranks without overloading "pipeline" which has specific meaning in Megatron. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

base.py: - Fix set_input_tensor: unwrap schedule's list wrapper before checking dict type, and unwrap single-element lists in dict values (P2P recv returns [tensor] for VPP compat) - Fix set_input_tensor for DDP: use unwrap_model to call set_input_tensor on underlying GPTModel through DDP wrapper - Remove pipeline_model_parallel_size == 1 assertion (contradicts non-colocated PP goal) test_mimo_1f1b_schedule.py: - Convert from standalone script to pytest class (TestMimo1F1BSchedule) - Add grid tracking + cleanup (destroy_all_grids, teardown_method) - Fix dist.new_group desync: create_all_embedding_groups upfront so all ranks participate in collective new_group calls - Fix embedding groups: set embd=None for encoder ranks (no shared word embeddings to sync in finalize_model_grads) - Fix NVTE env vars: clear conftest's NVTE_FLASH_ATTN=0 before GPTModel creation (LanguageModule asserts these match backend) - Use MIMO_LANGUAGE_MODULE_KEY, 6-dim grid shape with expt_dp - Cache pg_collections to avoid PG leaks in finalize_grads_func - Add BridgeCommunicator.destroy_broadcast_pgs() to teardown Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

megatron/core/models/mimo/config/role.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yashaswikarnati · 2026-03-24T03:40:35Z

/ok to test a94098d

yashaswikarnati · 2026-03-24T14:51:54Z

/ok to test a94098d

svcnvidia-nemo-ci · 2026-03-24T16:06:00Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23499505225

yashaswikarnati and others added 10 commits January 27, 2026 11:48

add pp stage checkers to p2p communicator

c601de4

add process group collection wrapper

84ae4f0

support multimodule pipelining in 1f1b schedule

0fa3dd8

fix dim mapping in torch cat bridge comm

b22f638

handle 3d 2d tensor conversion in multimodule comm

3badf57

add unit tests for multimodule pipeline schedules

20d03f5

rename module_collections to module_pgs for clarity

b102eb7

rename tensor conversion functions for clarity

ebbb509

Merge branch 'main' into yash/1f1b_changes

2d7c176

yashaswikarnati requested review from a team as code owners February 2, 2026 23:07

ko3n1g requested a review from a team February 2, 2026 23:07

yashaswikarnati and others added 6 commits February 2, 2026 20:27

Fix linting issues: format code and remove unused imports

0b6cefd

Add .worktrees/ to gitignore

7da19e1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

yashaswikarnati force-pushed the yash/mimo_non_colocated branch from 0807db4 to 7da19e1 Compare February 3, 2026 04:49

Merge branch 'main' into yash/mimo_non_colocated

5b94c0f

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 15:34 Inactive

dimapihtar requested review from erhoo82 and yanring February 4, 2026 15:35

copy-pr-bot bot had a problem deploying to nemo-ci February 4, 2026 15:35 Failure

ko3n1g added this to the Core 0.16 milestone Feb 4, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 15:35 Inactive

yashaswikarnati force-pushed the yash/mimo_non_colocated branch from 6e44ffd to 84a1ebf Compare March 19, 2026 00:23

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 19, 2026

yashaswikarnati and others added 2 commits March 19, 2026 08:20

yashaswikarnati removed request for a team, erhoo82 and yanring March 19, 2026 16:56

yashaswikarnati and others added 7 commits March 19, 2026 10:47

Rename PipelineMode to ModuleLayout

03207bc

Better describes the spatial arrangement of modules across ranks without overloading "pipeline" which has specific meaning in Megatron. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into yash/mimo_non_colocated

bd1b01b

Fix stale MiMo tests for multi-rank execution

1335de6

yaoyu-33 approved these changes Mar 23, 2026

View reviewed changes

jaredcasper approved these changes Mar 23, 2026

View reviewed changes

megatron/core/models/mimo/config/role.py Outdated Show resolved Hide resolved

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Mar 23, 2026

Fix copyright year to 2026 in new files

a94098d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot bot temporarily deployed to test March 24, 2026 14:53 Inactive

yashaswikarnati added this pull request to the merge queue Mar 24, 2026

Merged via the queue into NVIDIA:main with commit 70a89af Mar 24, 2026
63 of 64 checks passed

yashaswikarnati deleted the yash/mimo_non_colocated branch March 24, 2026 16:35

This was referenced Mar 24, 2026

Add MimoOptimizer for heterogeneous parallelism #4018

Closed

Add MimoOptimizer for heterogeneous parallelism #4019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-module heterogeneous parallelism support for MIMO model#3211

Add multi-module heterogeneous parallelism support for MIMO model#3211
yashaswikarnati merged 28 commits intoNVIDIA:mainfrom
yashaswikarnati:yash/mimo_non_colocated

yashaswikarnati commented Feb 2, 2026

Uh oh!

copy-pr-bot bot commented Feb 2, 2026

Uh oh!

dimapihtar commented Feb 4, 2026

Uh oh!

Uh oh!

yashaswikarnati commented Mar 24, 2026

Uh oh!

yashaswikarnati commented Mar 24, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

yashaswikarnati commented Feb 2, 2026

Summary

Dependencies

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Feb 2, 2026

Uh oh!

dimapihtar commented Feb 4, 2026

Uh oh!

Uh oh!

yashaswikarnati commented Mar 24, 2026

Uh oh!

yashaswikarnati commented Mar 24, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

(Step 1): Add PR label `Expert Review`