-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Enable AG/RS overlap with explicit process group passing #3249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
97d9da8
3df8ef5
d1c1c62
56ecd09
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2691,8 +2691,13 @@ def _add_distributed_args(parser): | |
| default=False, help='Manually register the FSDP communication buffers to NCCL user buffer.' | ||
| 'This option is only effective when use-megatron-fsdp and use-nccl-ub is set.') | ||
| group.add_argument('--create-all-gather-group', action='store_true', | ||
| help='Create a separate process group for all-gather operations ' | ||
| 'to overlap reduce-scatter and all-gather operations.') | ||
| help='Enable AG/RS overlap optimization by creating separate ' | ||
| 'all-gather communicators.') | ||
| group.add_argument('--megatron-fsdp-pg-collection', action='store_true', | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we exposing this knob supporting both behaviors? Is there something stopping us that I'm missing?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, this seems like an implementation detail, not something the user should have to worry about.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Context here should explain the status quo: #3249 (comment)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't like this either. |
||
| default=False, | ||
| help='[Experimental] Pass ProcessGroupCollection explicitly to ' | ||
| 'Megatron-FSDP instead of using global parallel_state. ' | ||
| 'Only effective when --use-megatron-fsdp is set.') | ||
| group.add_argument('--data-parallel-sharding-strategy', type=str, default='no_shard', | ||
| choices=['no_shard', 'optim', 'optim_grads', 'optim_grads_params'], | ||
| help='Sharding strategy of data parallelism.') | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug:
orderandrank_offsetare hardcoded here, butinitialize_model_parallel()accepts them as parameters. If a user passes a differentorder(e.g.,'tp-dp-pp-cp-ep') or non-zerorank_offset, thisRankGeneratorwill produce different rank lists than the actual DP groups, silently creating AG groups with wrong membership.The same issue applies to the expert
RankGeneratorbelow (line 1423-1431).A simpler and more robust approach would be to get the ranks directly from the already-created groups (which is what the tests in this PR already do):
Or even simpler — since
create_groupis a collective, you could collect the ranks from_DATA_PARALLEL_GLOBAL_RANKS_WITH_CP(the global that stores all dp-cp rank lists). This avoids re-deriving ranks entirely and guarantees consistency with the initialized state.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffnvidia WDYT?