add all_gather process-group for overlapping in fsdp disributed training#2663
add all_gather process-group for overlapping in fsdp disributed training#2663shjwudp merged 5 commits intoNVIDIA:mainfrom
Conversation
d924e10 to
5b61c81
Compare
|
I am having this warning on the PR, any idea why ? [MCORE][MultiGroupMemPoolAllocator] Failed to deregister mem pool from<torch.distributed.distributed_c10d.ProcessGroup object at 0x400228ee6170>(DATA_PARALLEL_GROUP_WITH_CP_AG) group!! cursor said it's normal behavior but I'm suspicious |
c5d7a3f to
859859e
Compare
|
|
||
| all_gather_ops = [] | ||
| if self.dist_index.get_fsdp_group(is_expert_parallel=False, all_gather=True) is not None: | ||
| # All-gather group when overlapping |
There was a problem hiding this comment.
Can we have a more accurate description? For example:
"All-gather group used when overlapping all-gather and gradient reduction."
There was a problem hiding this comment.
no problem changed to your description
youngeunkwon0405
left a comment
There was a problem hiding this comment.
Logic-wise LGTM. I left some comments regarding variable names and comments.
megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py
Outdated
Show resolved
Hide resolved
| # All-gather the module weights in each buffer shard into the allocated bucket. | ||
| # Now each rank will have a copy of this FSDP unit module's weights. | ||
| if self.buffer.dist_index.get_fsdp_group(is_expert_parallel=False, all_gather=True) is not None: | ||
| # All-gather group when overlapping |
There was a problem hiding this comment.
Could you please write a more specific comment that anyone without the context could understand what you are trying to do?
There was a problem hiding this comment.
sure let me know if it's good now
859859e to
453f26a
Compare
|
@jeffnvidia, thanks for creating this PR to add all_gather process group. I have two comments:
Let me know if it makes sense to you. Thanks |
|
Hey Sheng thanks for your comments
I can add it in a separate PR, I don't think it would require much work. The reason I didn't do it now was that we never tested it yet as far as I know. Once we move on to testing EP, I can create a new PR.
I've refactored per your recommendation: let me know if it fits what you had in mind |
dff91d4 to
d6dd2a6
Compare
|
I just realized that if EP is enabled, there will be a different DP group for MoE layers. I think if you run this with EP, it will not show the expected behavior. I think you have to modify the code for this, or at least insert an assertion about this. @shjwudp, please correct me if I am wrong. |
in the code, I am checking if expert parallelism is on and only doing the changes if not which is the scope of the POC |
|
@jeffnvidia, thanks for the update. The updated code looks good to me. It is your call if you want to have another PR to support ET. However, more recent models all have ET. I already have Llama4 setup, and plan to try overlapped AG+RS with Llama4 soon. |
I don't see the code part that checks |
It's line 1890 : https://github.com/NVIDIA/Megatron-LM/pull/2663/files#diff-da62f73a7a6a4ac7815ed316a147ba348a7915e35a2f4885ceaf1678e5f650fbR1890 if it is expert_parallelism, it wont even try to create the separate group |
b8dcd85 to
852aa57
Compare
852aa57 to
caf8de5
Compare
|
I rebased on main to fix the merging error. @youngeunkwon0405, could you re-run the CI ? Thanks a lot Could you approve the PR @jaredcasper as we said and I'll start working immediately on a PR that puts everything in order. Thanks a lot |
|
/ok to test caf8de5 |
thanks Youngeun, we're having the same bug again of the CICD which seems to be random, I don't why and if it's a blocker or not |
|
Got it, I will reinitiate the failed runs. |
|
Hi guys, who do I need to get the approval from to finally merge this PR ? @ericharper @jaredcasper @NVIDIA/core-adlr ? |
|
Needs an approval from @NVIDIA/core-adlr or else it won't merge. It looks like concerns pertaining to MCore PG management have not been resolved. |
And just to chime in here, |
What does this PR do ?
This PR intends to separate the all_gather process-group from reduce-scatter and the other operations. The goal is to have overlapping of these 2 collectives which when combined with SHARP have been proven to greatly increase performances (~15%)
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.