adjusting the communication method in graph mode#1194
adjusting the communication method in graph mode#1194ganyi1996ppo merged 1 commit intovllm-project:mainfrom
Conversation
b1227b4 to
9294612
Compare
|
I have cancelled the e2e test for quick fix deepseek problem. Please recheck the CI later. Sorry about this. |
|
Could you add performance test logs and benefit result details to the PR description? |
983b820 to
6f640c7
Compare
8e79e3d to
7c32c3f
Compare
235d1dc to
981a6df
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
d635d22 to
16d9dce
Compare
d6d17d2 to
f3affb8
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3d09b8e to
0b315f8
Compare
5e67447 to
cc59238
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1194 +/- ##
==========================================
- Coverage 27.39% 27.04% -0.36%
==========================================
Files 56 56
Lines 6191 6275 +84
==========================================
+ Hits 1696 1697 +1
- Misses 4495 4578 +83
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
45d2385 to
0169aed
Compare
Signed-off-by: sharonyunyun <zhangying134@huawei.com>
### What this PR does / why we need it? Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group,to remove stridedsliced and all_gather in MOE layer. when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used. According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage. **Before Improvement** Profiling kernel_details  Evaluation   **After Improvement** Profiling kernel_details  Evaluation   ### Does this PR introduce _any_ user-facing change? Users need to configure enable_multistream_moe=True ### How was this patch tested? Add e2e test cases to cover code logic Signed-off-by: sharonyunyun <zhangying134@huawei.com>
### What this PR does / why we need it? Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group,to remove stridedsliced and all_gather in MOE layer. when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used. According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage. **Before Improvement** Profiling kernel_details  Evaluation   **After Improvement** Profiling kernel_details  Evaluation   ### Does this PR introduce _any_ user-facing change? Users need to configure enable_multistream_moe=True ### How was this patch tested? Add e2e test cases to cover code logic Signed-off-by: sharonyunyun <zhangying134@huawei.com>
### What this PR does / why we need it? Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group,to remove stridedsliced and all_gather in MOE layer. when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used. According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage. **Before Improvement** Profiling kernel_details  Evaluation   **After Improvement** Profiling kernel_details  Evaluation   ### Does this PR introduce _any_ user-facing change? Users need to configure enable_multistream_moe=True ### How was this patch tested? Add e2e test cases to cover code logic Signed-off-by: sharonyunyun <zhangying134@huawei.com>
What this PR does / why we need it?
Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group,to remove stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage.
Before Improvement



Profiling kernel_details
Evaluation
After Improvement



Profiling kernel_details
Evaluation
Does this PR introduce any user-facing change?
Users need to configure enable_multistream_moe=True
How was this patch tested?
Add e2e test cases to cover code logic