Enable AG/RS overlap with explicit process group passing by jeffnvidia · Pull Request #3249 · NVIDIA/Megatron-LM

jeffnvidia · 2026-02-04T14:54:22Z

What does this PR do ?

This PR enables all-gather (AG) / reduce-scatter (RS) communication overlap for both regular data parallelism and Expert Parallelism (MoE models) by migrating from global process group management to explicit argument passing via ProcessGroupCollection.

Motivation

Problem: PR (#2663) (merged 2026-01-27) implemented AG/RS overlap for regular DP using global state in parallel_state.py, but:

Did not support Expert Parallelism (MoE models)
Violated Megatron-LM's architectural direction away from global state
Made testing harder due to indirect data flow

Solution: This PR refactors the implementation to use explicit process group passing while adding MoE support.

Key Changes

1. ProcessGroupCollection Extension (`process_groups_config.py`)

Added dp_cp_ag field for regular data parallel all-gather group
Added expt_dp_ag field for expert data parallel all-gather group (NEW for MoE)
Both default to None - users must create them explicitly (opt-in feature)

2. FSDPDistributedIndex Refactor (`utils.py`)

Before: Retrieved AG groups from parallel_state globals
After: Accepts fsdp_group_ag and expt_fsdp_group_ag as explicit constructor parameters
get_fsdp_group(..., independent_all_gather=True) returns appropriate AG group based on parameter type

3. Explicit Group Extraction (`mcore_fsdp_adapter.py`)

Extracts AG groups from pg_collection using getattr()
Passes them explicitly to FSDPDistributedIndex
Handles both HSDP and non-HSDP configurations

4. Expert Parameter Support (`param_and_grad_buffer.py`)

Automatically selects correct AG group based on group.is_expert_param flag
Registers both regular and expert AG groups with UBR for NCCL optimization
Backward compatible: uses regular DP group if AG groups not provided

5. Cleanup of PR #2663 Global State

Removed:

_DATA_PARALLEL_GROUP_WITH_CP_AG global variable
has_separate_all_gather_group() function
independent_all_gather parameter from get_data_parallel_group()
create_all_gather_group parameter from initialize_model_parallel()
--create-all-gather-group CLI argument

Benefits

✅ Clean Architecture: No new globals in parallel_state.py
✅ Explicit Data Flow: Process groups passed as arguments, not accessed globally
✅ Expert Parallelism Support: AG/RS overlap now works for MoE models
✅ Testability: Easier to mock and test with dependency injection
✅ Backward Compatible: Opt-in feature (defaults to None, same behavior as before)

Migration Guide for Users

Before (PR #1 - no longer supported):

copy-pr-bot · 2026-02-04T14:54:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dimapihtar · 2026-02-04T15:03:20Z

Hi, @jeffnvidia

could you resolve merge conflict, please?

jeffnvidia · 2026-02-04T15:17:11Z

Hi, @jeffnvidia

could you resolve merge conflict, please?

Hi,

yes, resolved

dimapihtar · 2026-02-04T15:46:59Z

/ok to test f1aa8e4

shjwudp

Looks good to me for the Megatron-FSDP part.

shjwudp · 2026-02-06T17:33:18Z

megatron/training/arguments.py

    group.add_argument('--fsdp-manual-registration', action='store_true', dest='fsdp_manual_registration',
                       default=False, help='Manually register the FSDP communication buffers to NCCL user buffer.'
                       'This option is only effective when use-megatron-fsdp and use-nccl-ub is set.')
+    group.add_argument('--use-sharp', action='store_true', 


These three arguments should be initialized from DistributedInitConfig.

For reference, see this Megatron-LM commit: adce147#diff-42fa19ec8893eabf951a5bb21edfc0dfe7c9c8949d5087ab133f5897ea0e3213R2203

jeffnvidia · 2026-02-08T15:02:31Z

not sure why I'm suddenly having a secrets detector error, I just deleted lines I myself added

BoxiangW

LGTM thanks!

jeffnvidia · 2026-03-01T14:06:29Z

thanks for the approval, can people from @NVIDIA/mcore-oncall @NVIDIA/core-nemo review it ? Thanks a lot

jeffnvidia · 2026-03-09T16:10:48Z

Support this PR going through but to take some baby steps, let's do the following:

Add a Megatron-LM experimental/temporary argument --megatron-fsdp-pg-collection that controls whether we pass pg_collection to FSDP (and only FSDP, i.e. args.use_megatron_fsdp=True, then add it to DP kwargs) so we can unblock this PR but still support the legacy global parallel_state. This slows down the pace of migration but still checks in this migration-ready PR which I think looks great.

Keep the extra AG groups in the global parallel_state for now. When we migrate, we can deprecate those groups like we do in this PR.

hey @cspades, I edited the PR according to the commit, can you review it ?

Thanks a lot

cspades

LGTM, thanks for jumping through some hoops on this one.

TODO: Benchmark and migrate to pg_collection.

megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

cspades · 2026-03-12T19:01:43Z

/ok to test f9b5b47

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

jeffnvidia · 2026-03-19T11:53:11Z

hey @cspades @yaoyu-33 , do I need to do something else to merge this PR now ? Thanks !

cspades · 2026-03-25T13:46:19Z

/ok to test d1c1c62

cspades · 2026-03-25T14:52:07Z

/ok to test 56ecd09

asolergi-nv · 2026-03-25T15:52:52Z

megatron/training/arguments.py

-                   'to overlap reduce-scatter and all-gather operations.')
+                       help='Enable AG/RS overlap optimization by creating separate '
+                       'all-gather communicators.')
+    group.add_argument('--megatron-fsdp-pg-collection', action='store_true',


Why are we exposing this knob supporting both behaviors? Is there something stopping us that I'm missing?

+1, this seems like an implementation detail, not something the user should have to worry about.

Context here should explain the status quo: #3249 (comment)

I don't like this either.

asolergi-nv · 2026-03-25T15:54:33Z

/claude review

claude · 2026-03-25T15:56:46Z

megatron/core/parallel_state.py

+    decoder_rank_gen = RankGenerator(
+        tp=tp_size, ep=1, dp=dp_size, pp=pp_size, cp=cp_size, order='tp-cp-ep-dp-pp', rank_offset=0
+    )


Bug: order and rank_offset are hardcoded here, but initialize_model_parallel() accepts them as parameters. If a user passes a different order (e.g., 'tp-dp-pp-cp-ep') or non-zero rank_offset, this RankGenerator will produce different rank lists than the actual DP groups, silently creating AG groups with wrong membership.

The same issue applies to the expert RankGenerator below (line 1423-1431).

A simpler and more robust approach would be to get the ranks directly from the already-created groups (which is what the tests in this PR already do):

# Regular DP AG group dp_cp_group = get_data_parallel_group(with_context_parallel=True) all_dp_cp_ranks = get_data_parallel_group_ranks(with_context_parallel=True) dp_cp_ag_group = None for ranks_with_cp in all_dp_cp_ranks: # or iterate all groups group_with_cp_ag = create_group( ranks_with_cp, timeout=timeout, pg_options=get_nccl_options('dp_cp', nccl_comm_cfgs or {}), group_desc='DATA_PARALLEL_GROUP_WITH_CP_AG', ) if rank in ranks_with_cp: dp_cp_ag_group = group_with_cp_ag

Or even simpler — since create_group is a collective, you could collect the ranks from _DATA_PARALLEL_GLOBAL_RANKS_WITH_CP (the global that stores all dp-cp rank lists). This avoids re-deriving ranks entirely and guarantees consistency with the initialized state.

@jeffnvidia WDYT?

claude

Overall the approach of moving from global state to explicit ProcessGroupCollection passing looks good. One bug flagged inline:

create_all_gather_groups() hardcodes order and rank_offset — the RankGenerator in this function uses order='tp-cp-ep-dp-pp' and rank_offset=0, but initialize_model_parallel() accepts these as configurable parameters. For non-default configurations, the AG groups will silently get wrong rank membership. The fix is straightforward: retrieve ranks from the already-initialized groups instead of re-deriving them.

cspades · 2026-03-25T17:04:21Z

TODO for me to benchmark this PR on PG collection backend, and fully deprecate the global parallel_state for Megatron-FSDP. cc @shjwudp

jeffnvidia requested review from a team as code owners February 4, 2026 14:54

ko3n1g requested a review from a team February 4, 2026 14:54

dimapihtar added complexity: medium Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Feb 4, 2026

jeffnvidia mentioned this pull request Feb 4, 2026

Add ag group expert parallelism #2910

Closed

6 tasks

jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch from cfdfe8b to f1aa8e4 Compare February 4, 2026 15:15

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 15:47 Inactive

dimapihtar requested a review from BoxiangW February 4, 2026 15:47

ko3n1g added this to the Core 0.16 milestone Feb 4, 2026

copy-pr-bot bot had a problem deploying to nemo-ci February 4, 2026 15:47 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 15:47 Inactive

copy-pr-bot bot temporarily deployed to test February 4, 2026 15:48 Inactive

jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch 2 times, most recently from 58c2d96 to 1757e42 Compare February 5, 2026 16:45

shjwudp approved these changes Feb 6, 2026

View reviewed changes

shjwudp added the module: megatron-fsdp label Feb 6, 2026

jeffnvidia mentioned this pull request Feb 23, 2026

Add AG/RS overlap distributed init support NVIDIA-NeMo/Megatron-Bridge#2487

Open

5 tasks

BoxiangW approved these changes Feb 23, 2026

View reviewed changes

BoxiangW added Final Review PR is in the "final review" stage and removed Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Feb 23, 2026

jaredcasper approved these changes Feb 23, 2026

View reviewed changes

jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch 3 times, most recently from f59e385 to 98eecc7 Compare March 9, 2026 16:09

jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch from 98eecc7 to 00cbcf2 Compare March 10, 2026 09:33

cspades approved these changes Mar 12, 2026

View reviewed changes

megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to test March 12, 2026 19:02 Inactive

yaoyu-33 approved these changes Mar 12, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Mar 12, 2026

jeffnvidia added 3 commits March 16, 2026 17:20

Support AG/RS overlap with dual-path process group passing

97d9da8

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

Fix test: use get_process_group_ranks instead of non-existent helper

3df8ef5

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

Register NVLink-domain AG groups before IB-domain HSDP group in UBR

d1c1c62

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch from f9b5b47 to d1c1c62 Compare March 16, 2026 15:21

copy-pr-bot bot temporarily deployed to test March 25, 2026 13:47 Inactive

Merge branch 'main' into add_ag_groups_to_pg_collection

56ecd09

cspades requested review from a team as code owners March 25, 2026 14:46

copy-pr-bot bot temporarily deployed to test March 25, 2026 14:53 Inactive

asolergi-nv approved these changes Mar 25, 2026

View reviewed changes

claude bot reviewed Mar 25, 2026

View reviewed changes

jaredcasper self-requested a review March 25, 2026 17:31

Phlip79 removed the Approved All necessary approvals have been made label Mar 25, 2026

Conversation

jeffnvidia commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Motivation

Key Changes

1. ProcessGroupCollection Extension (process_groups_config.py)

2. FSDPDistributedIndex Refactor (utils.py)

3. Explicit Group Extraction (mcore_fsdp_adapter.py)

4. Expert Parameter Support (param_and_grad_buffer.py)

5. Cleanup of PR #2663 Global State

Benefits

Migration Guide for Users

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

dimapihtar commented Feb 4, 2026

Uh oh!

jeffnvidia commented Feb 4, 2026

Uh oh!

dimapihtar commented Feb 4, 2026

Uh oh!

shjwudp left a comment

Choose a reason for hiding this comment

Uh oh!

shjwudp Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

jeffnvidia commented Feb 8, 2026

Uh oh!

BoxiangW left a comment

Choose a reason for hiding this comment

Uh oh!

jeffnvidia commented Mar 1, 2026

Uh oh!

jeffnvidia commented Mar 9, 2026

Uh oh!

cspades left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cspades commented Mar 12, 2026

Uh oh!

jeffnvidia commented Mar 19, 2026

Uh oh!

cspades commented Mar 25, 2026

Uh oh!

cspades commented Mar 25, 2026

Uh oh!

asolergi-nv Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jaredcasper Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deepakn94 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

asolergi-nv commented Mar 25, 2026

Uh oh!

claude bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

cspades commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

jeffnvidia commented Feb 4, 2026 •

edited

Loading

1. ProcessGroupCollection Extension (`process_groups_config.py`)

2. FSDPDistributedIndex Refactor (`utils.py`)

3. Explicit Group Extraction (`mcore_fsdp_adapter.py`)

4. Expert Parameter Support (`param_and_grad_buffer.py`)

cspades commented Mar 25, 2026 •

edited

Loading