Skip to content

Enable AG/RS overlap with explicit process group passing#3249

Open
jeffnvidia wants to merge 4 commits intoNVIDIA:mainfrom
jeffnvidia:add_ag_groups_to_pg_collection
Open

Enable AG/RS overlap with explicit process group passing#3249
jeffnvidia wants to merge 4 commits intoNVIDIA:mainfrom
jeffnvidia:add_ag_groups_to_pg_collection

Conversation

@jeffnvidia
Copy link
Copy Markdown
Contributor

@jeffnvidia jeffnvidia commented Feb 4, 2026

What does this PR do ?

This PR enables all-gather (AG) / reduce-scatter (RS) communication overlap for both regular data parallelism and Expert Parallelism (MoE models) by migrating from global process group management to explicit argument passing via ProcessGroupCollection.

Motivation

Problem: PR (#2663) (merged 2026-01-27) implemented AG/RS overlap for regular DP using global state in parallel_state.py, but:

  1. Did not support Expert Parallelism (MoE models)
  2. Violated Megatron-LM's architectural direction away from global state
  3. Made testing harder due to indirect data flow

Solution: This PR refactors the implementation to use explicit process group passing while adding MoE support.

Key Changes

1. ProcessGroupCollection Extension (process_groups_config.py)

  • Added dp_cp_ag field for regular data parallel all-gather group
  • Added expt_dp_ag field for expert data parallel all-gather group (NEW for MoE)
  • Both default to None - users must create them explicitly (opt-in feature)

2. FSDPDistributedIndex Refactor (utils.py)

  • Before: Retrieved AG groups from parallel_state globals
  • After: Accepts fsdp_group_ag and expt_fsdp_group_ag as explicit constructor parameters
  • get_fsdp_group(..., independent_all_gather=True) returns appropriate AG group based on parameter type

3. Explicit Group Extraction (mcore_fsdp_adapter.py)

  • Extracts AG groups from pg_collection using getattr()
  • Passes them explicitly to FSDPDistributedIndex
  • Handles both HSDP and non-HSDP configurations

4. Expert Parameter Support (param_and_grad_buffer.py)

  • Automatically selects correct AG group based on group.is_expert_param flag
  • Registers both regular and expert AG groups with UBR for NCCL optimization
  • Backward compatible: uses regular DP group if AG groups not provided

5. Cleanup of PR #2663 Global State

Removed:

  • _DATA_PARALLEL_GROUP_WITH_CP_AG global variable
  • has_separate_all_gather_group() function
  • independent_all_gather parameter from get_data_parallel_group()
  • create_all_gather_group parameter from initialize_model_parallel()
  • --create-all-gather-group CLI argument

Benefits

Clean Architecture: No new globals in parallel_state.py
Explicit Data Flow: Process groups passed as arguments, not accessed globally
Expert Parallelism Support: AG/RS overlap now works for MoE models
Testability: Easier to mock and test with dependency injection
Backward Compatible: Opt-in feature (defaults to None, same behavior as before)

Migration Guide for Users

Before (PR #1 - no longer supported):

@jeffnvidia jeffnvidia requested review from a team as code owners February 4, 2026 14:54
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ko3n1g ko3n1g requested a review from a team February 4, 2026 14:54
@dimapihtar dimapihtar added complexity: medium Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Feb 4, 2026
@dimapihtar
Copy link
Copy Markdown
Contributor

Hi, @jeffnvidia

could you resolve merge conflict, please?

@jeffnvidia jeffnvidia mentioned this pull request Feb 4, 2026
6 tasks
@jeffnvidia jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch from cfdfe8b to f1aa8e4 Compare February 4, 2026 15:15
@jeffnvidia
Copy link
Copy Markdown
Contributor Author

Hi, @jeffnvidia

could you resolve merge conflict, please?

Hi,

yes, resolved

@dimapihtar
Copy link
Copy Markdown
Contributor

/ok to test f1aa8e4

Copy link
Copy Markdown
Contributor

@shjwudp shjwudp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me for the Megatron-FSDP part.

group.add_argument('--fsdp-manual-registration', action='store_true', dest='fsdp_manual_registration',
default=False, help='Manually register the FSDP communication buffers to NCCL user buffer.'
'This option is only effective when use-megatron-fsdp and use-nccl-ub is set.')
group.add_argument('--use-sharp', action='store_true',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three arguments should be initialized from DistributedInitConfig.

For reference, see this Megatron-LM commit: adce147#diff-42fa19ec8893eabf951a5bb21edfc0dfe7c9c8949d5087ab133f5897ea0e3213R2203

@jeffnvidia
Copy link
Copy Markdown
Contributor Author

not sure why I'm suddenly having a secrets detector error, I just deleted lines I myself added

Copy link
Copy Markdown
Contributor

@BoxiangW BoxiangW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@BoxiangW BoxiangW added Final Review PR is in the "final review" stage and removed Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Feb 23, 2026
@jeffnvidia
Copy link
Copy Markdown
Contributor Author

thanks for the approval, can people from @NVIDIA/mcore-oncall @NVIDIA/core-nemo review it ? Thanks a lot

@jeffnvidia jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch 3 times, most recently from f59e385 to 98eecc7 Compare March 9, 2026 16:09
@jeffnvidia
Copy link
Copy Markdown
Contributor Author

Support this PR going through but to take some baby steps, let's do the following:

  1. Add a Megatron-LM experimental/temporary argument --megatron-fsdp-pg-collection that controls whether we pass pg_collection to FSDP (and only FSDP, i.e. args.use_megatron_fsdp=True, then add it to DP kwargs) so we can unblock this PR but still support the legacy global parallel_state. This slows down the pace of migration but still checks in this migration-ready PR which I think looks great.
  2. Keep the extra AG groups in the global parallel_state for now. When we migrate, we can deprecate those groups like we do in this PR.

hey @cspades, I edited the PR according to the commit, can you review it ?

Thanks a lot

@jeffnvidia jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch from 98eecc7 to 00cbcf2 Compare March 10, 2026 09:33
Copy link
Copy Markdown
Member

@cspades cspades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for jumping through some hoops on this one.

TODO: Benchmark and migrate to pg_collection.

@cspades
Copy link
Copy Markdown
Member

cspades commented Mar 12, 2026

/ok to test f9b5b47

@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Mar 12, 2026
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
@jeffnvidia jeffnvidia force-pushed the add_ag_groups_to_pg_collection branch from f9b5b47 to d1c1c62 Compare March 16, 2026 15:21
@jeffnvidia
Copy link
Copy Markdown
Contributor Author

hey @cspades @yaoyu-33 , do I need to do something else to merge this PR now ? Thanks !

@cspades
Copy link
Copy Markdown
Member

cspades commented Mar 25, 2026

/ok to test d1c1c62

@cspades cspades requested review from a team as code owners March 25, 2026 14:46
@cspades
Copy link
Copy Markdown
Member

cspades commented Mar 25, 2026

/ok to test 56ecd09

'to overlap reduce-scatter and all-gather operations.')
help='Enable AG/RS overlap optimization by creating separate '
'all-gather communicators.')
group.add_argument('--megatron-fsdp-pg-collection', action='store_true',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we exposing this knob supporting both behaviors? Is there something stopping us that I'm missing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this seems like an implementation detail, not something the user should have to worry about.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context here should explain the status quo: #3249 (comment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this either.

@asolergi-nv
Copy link
Copy Markdown
Contributor

/claude review

Comment on lines +1403 to +1405
decoder_rank_gen = RankGenerator(
tp=tp_size, ep=1, dp=dp_size, pp=pp_size, cp=cp_size, order='tp-cp-ep-dp-pp', rank_offset=0
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: order and rank_offset are hardcoded here, but initialize_model_parallel() accepts them as parameters. If a user passes a different order (e.g., 'tp-dp-pp-cp-ep') or non-zero rank_offset, this RankGenerator will produce different rank lists than the actual DP groups, silently creating AG groups with wrong membership.

The same issue applies to the expert RankGenerator below (line 1423-1431).

A simpler and more robust approach would be to get the ranks directly from the already-created groups (which is what the tests in this PR already do):

# Regular DP AG group
dp_cp_group = get_data_parallel_group(with_context_parallel=True)
all_dp_cp_ranks = get_data_parallel_group_ranks(with_context_parallel=True)
dp_cp_ag_group = None
for ranks_with_cp in all_dp_cp_ranks:  # or iterate all groups
    group_with_cp_ag = create_group(
        ranks_with_cp,
        timeout=timeout,
        pg_options=get_nccl_options('dp_cp', nccl_comm_cfgs or {}),
        group_desc='DATA_PARALLEL_GROUP_WITH_CP_AG',
    )
    if rank in ranks_with_cp:
        dp_cp_ag_group = group_with_cp_ag

Or even simpler — since create_group is a collective, you could collect the ranks from _DATA_PARALLEL_GLOBAL_RANKS_WITH_CP (the global that stores all dp-cp rank lists). This avoids re-deriving ranks entirely and guarantees consistency with the initialized state.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffnvidia WDYT?

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the approach of moving from global state to explicit ProcessGroupCollection passing looks good. One bug flagged inline:

create_all_gather_groups() hardcodes order and rank_offset — the RankGenerator in this function uses order='tp-cp-ep-dp-pp' and rank_offset=0, but initialize_model_parallel() accepts these as configurable parameters. For non-default configurations, the AG groups will silently get wrong rank membership. The fix is straightforward: retrieve ranks from the already-initialized groups instead of re-deriving them.

@cspades
Copy link
Copy Markdown
Member

cspades commented Mar 25, 2026

TODO for me to benchmark this PR on PG collection backend, and fully deprecate the global parallel_state for Megatron-FSDP. cc @shjwudp

@jaredcasper jaredcasper self-requested a review March 25, 2026 17:31
@Phlip79 Phlip79 removed the Approved All necessary approvals have been made label Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.