[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant#819
Merged
ganyi1996ppo merged 15 commits intovllm-project:mainfrom May 15, 2025
Merged
[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant#819ganyi1996ppo merged 15 commits intovllm-project:mainfrom
ganyi1996ppo merged 15 commits intovllm-project:mainfrom
Conversation
added 2 commits
May 12, 2025 21:42
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
af2215a to
d6cf7a1
Compare
b2ca756 to
e0ab8e0
Compare
added 4 commits
May 13, 2025 14:18
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
| dist.all_to_all_single(gather_sizes, | ||
| scatter_sizes, | ||
| group=ep_group.device_group) | ||
| scatter_size_list = scatter_sizes.cpu().tolist() |
Collaborator
There was a problem hiding this comment.
This may introduce serious performance regression, please note that we will change this in future.
| gather_dim, scatter_sizes, | ||
| gather_sizes) | ||
|
|
||
| def reduce_scatter(self, |
Collaborator
There was a problem hiding this comment.
Remove this if you do not use it.
| attn_metadata = get_forward_context().attn_metadata | ||
| if attn_metadata is None: | ||
| # when profile runs, force experts load balance to avoid high memory | ||
| # consumption from 1 rank. |
Collaborator
There was a problem hiding this comment.
Add more comments on this
added 2 commits
May 13, 2025 22:37
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
3349fbd to
30eafb9
Compare
Signed-off-by: angazenn <zengyanjia@huawei.com>
ganyi1996ppo
approved these changes
May 15, 2025
yiz-liu
added a commit
to yiz-liu/vllm-ascend
that referenced
this pull request
May 17, 2025
…_swiglu_quant (vllm-project#819)" This reverts commit 1e67089.
Contributor
|
This is the best solution for A3 performance. Is there a best solution for A2 performance? |
ganyi1996ppo
pushed a commit
that referenced
this pull request
May 24, 2025
### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR #819 when running deepseekv3 series models: 1. #819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
momo609
pushed a commit
to momo609/vllm-ascend
that referenced
this pull request
May 30, 2025
…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>
zxdukki
pushed a commit
to zxdukki/vllm-ascend
that referenced
this pull request
Jun 3, 2025
…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
chopper0126
pushed a commit
to chopper0126/vllm-ascend
that referenced
this pull request
Oct 16, 2025
…quant (vllm-project#819) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? 1. This PR introduces native `all_to_all` communication operator to fix `allgather` bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs. 2. The operator `npu_dequant_swiglu_quant` only supports input hidden_states with dtype `torch.int32`. This tensor occupies space of `global_bs * seq_len * topk * hidden_size`, which might be very large as `ep_size` grows. Therefore we need to disable this operator and use original `swiglu` && `quantize`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By performing offline inference:  --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
chopper0126
pushed a commit
to chopper0126/vllm-ascend
that referenced
this pull request
Oct 16, 2025
…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
Angazenn
added a commit
to Angazenn/vllm-ascend
that referenced
this pull request
Oct 21, 2025
…quant (vllm-project#819) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? 1. This PR introduces native `all_to_all` communication operator to fix `allgather` bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs. 2. The operator `npu_dequant_swiglu_quant` only supports input hidden_states with dtype `torch.int32`. This tensor occupies space of `global_bs * seq_len * topk * hidden_size`, which might be very large as `ep_size` grows. Therefore we need to disable this operator and use original `swiglu` && `quantize`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By performing offline inference:  --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
Angazenn
added a commit
to Angazenn/vllm-ascend
that referenced
this pull request
Oct 21, 2025
…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
all_to_allcommunication operator to fixallgatherbugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs.npu_dequant_swiglu_quantonly supports input hidden_states with dtypetorch.int32. This tensor occupies space ofglobal_bs * seq_len * topk * hidden_size, which might be very large asep_sizegrows. Therefore we need to disable this operator and use originalswiglu&&quantize.Does this PR introduce any user-facing change?
No.
How was this patch tested?
By performing offline inference:
