Conversation
|
@ch-wan Could you please take a look? |
There was a problem hiding this comment.
Summary of Changes
Hello @trevor-m, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on optimizing the communication strategy for Deepseek models when running with Data Parallelism (DP) and padding enabled. By replacing the standard all-reduce operation with reduce-scatter after Mixture-of-Experts (MoE) and Multi-Layer Perceptron (MLP) layers, the changes aim to improve performance and efficiency in these specific distributed training scenarios.
Highlights
- Communication Optimization: Introduced
dp_reduce_scatter_tensorto enable more efficient communication for Data Parallel (DP) operations, specifically for padded inputs. - Conditional Communication Strategy: Implemented logic within the layer communicator to conditionally use
reduce-scatterinstead ofscatteron hidden states when DP with padding is active, optimizing data movement. - MLP/MoE All-Reduce Skip: Modified the Deepseek model's MLP and MoE layers to skip the
all-reduceoperation after their forward pass whenreduce-scatteris being utilized, preventing redundant communication. - Parameter Renaming: Renamed the
can_fuse_mlp_allreduceparameter toskip_all_reducein linear layers for broader applicability and clarity regarding whenall-reduceshould be bypassed.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces an optimization to use reduce_scatter instead of all_reduce for data parallelism (DP) when padding is enabled, which is a good performance enhancement. The changes are logical and well-contained. My main feedback focuses on improving code clarity and maintainability by reducing coupling between components and clarifying function names.
135e2a7 to
e919208
Compare
|
LGTM. |
bd50760 to
eb82cf3
Compare
ch-wan
left a comment
There was a problem hiding this comment.
Nice work. I have approved the PR. Could you clean up merge conflicts?
|
Oh, I have another question. This communicator is also used in llama and qwen. I believe your optimization can be applied to these models as well. |
eb82cf3 to
50e24ec
Compare
|
@ch-wan Thank you, I have fix the conflicts. Yes, I believe this can be applied to those models too - can this be done in a follow-up PR? |
|
@merrymercy Do you mind reviewing this PR since you are a code owner? |
7fd0823 to
810b2d7
Compare
Motivation
Similar to how #8280 enables all-gather for DP when using padding, this PR enables reducescatter instead of all-reduce following MOE/MLP layers in Deepseek.
Modifications
If DP with padding is used, all-reduce is skipped after MoE and MLP. The layer communicator will perform reduce scatter on hidden states instead of just scatter.
Currently this is implemented for deepseek, but we can extend this easily to other models that use
LayerCommunicatorlater.Accuracy Test
Benchmark & Profiling
with PR:
without PR:
Checklist