-
Notifications
You must be signed in to change notification settings - Fork 624
【main】SP For Qwen3 MoE #2209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【main】SP For Qwen3 MoE #2209
Conversation
Signed-off-by: libaokui <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
vllm_ascend/ops/sequence_parallel.py
Outdated
| get_tp_group, tensor_model_parallel_all_gather, | ||
| tensor_model_parallel_reduce_scatter) | ||
| from vllm.forward_context import get_forward_context | ||
| from vllm.platforms import current_platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use current_platform in vllm-ascend. import vllm_ascend.platform directlly
| ) | ||
| self.mc2_mask[:lengths_sum_unpadding] = True | ||
|
|
||
| def padding_aligned_reduce_scatter(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functions are duplicated with the pad and unpad functions in flashcommv1,can we aggregating them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion. We chose not to adopt the flashcomm1 implementation from the 091 branch for two reasons:
-
The existing flashcomm1 implementations for Qwen2 and Qwen3 in the repository are inconsistent. We've created a model-level interface here to minimize migration efforts for sparse models (SP).
-
Currently, flashcomm1's graph mode support for sparse models like Qwen2 and Qwen3 isn't available in the main branch. Merging it would impact graph mode performance, so we're keeping it separate for now. Note that merging Qwen3 MoE's SP implementation won't affect the current status.
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
|
@lbk-sys can you list out your testing CLI, e.g, vllm serve, and your testing datasets? Thanks. |
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2209 +/- ##
==========================================
- Coverage 76.65% 76.09% -0.56%
==========================================
Files 113 114 +1
Lines 12763 13103 +340
==========================================
+ Hits 9783 9971 +188
- Misses 2980 3132 +152
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: libaokui <[email protected]>
Thank you for your attention. For testing accuracy, we have run the AIME dataset. For performance testing, we have conducted benchmarks using vLLM and offline scripts. The input data for the model follows the format t*h, so as long as the number of input tokens in the P-phase meets this requirement, benefits can be achieved (regardless of the dataset). |
and deepscaler dataset |
| with VllmRunner( | ||
| snapshot_download("Qwen/Qwen3-30B-A3B"), | ||
| dtype="auto", | ||
| tensor_parallel_size=4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is just 2 cards on CI machine, let's reduce tp size to 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done , thanks
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
Signed-off-by: libaokui <[email protected]>
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
compilation_config={
"pass_config":{
"enable_sequence_parallelism": True
}
},
enable_expert_parallel=True,
```
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@9edd1db
---------
Signed-off-by: libaokui <[email protected]>
Co-authored-by: libaokui <[email protected]>
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
compilation_config={
"pass_config":{
"enable_sequence_parallelism": True
}
},
enable_expert_parallel=True,
```
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@9edd1db
---------
Signed-off-by: libaokui <[email protected]>
Co-authored-by: libaokui <[email protected]>
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
compilation_config={
"pass_config":{
"enable_sequence_parallelism": True
}
},
enable_expert_parallel=True,
```
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@9edd1db
---------
Signed-off-by: libaokui <[email protected]>
Co-authored-by: libaokui <[email protected]>
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
compilation_config={
"pass_config":{
"enable_sequence_parallelism": True
}
},
enable_expert_parallel=True,
```
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@9edd1db
---------
Signed-off-by: libaokui <[email protected]>
Co-authored-by: libaokui <[email protected]>
What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%.
Does this PR introduce any user-facing change?
How was this patch tested?