[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant by Angazenn · Pull Request #819 · vllm-project/vllm-ascend

Angazenn · 2025-05-12T08:53:09Z

What this PR does / why we need it?

This PR introduces native all_to_all communication operator to fix allgather bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs.
The operator npu_dequant_swiglu_quant only supports input hidden_states with dtype torch.int32. This tensor occupies space of global_bs * seq_len * topk * hidden_size, which might be very large as ep_size grows. Therefore we need to disable this operator and use original swiglu && quantize.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By performing offline inference:

Signed-off-by: angazenn <zengyanjia@huawei.com>

ganyi1996ppo · 2025-05-13T11:27:55Z

+        dist.all_to_all_single(gather_sizes,
+                               scatter_sizes,
+                               group=ep_group.device_group)
+        scatter_size_list = scatter_sizes.cpu().tolist()


This may introduce serious performance regression, please note that we will change this in future.

ganyi1996ppo · 2025-05-13T11:28:52Z

+                                                   gather_dim, scatter_sizes,
+                                                   gather_sizes)
+
+    def reduce_scatter(self,


Remove this if you do not use it.

ganyi1996ppo · 2025-05-13T11:29:32Z

        attn_metadata = get_forward_context().attn_metadata
-        if attn_metadata is None:
+        # when profile runs, force experts load balance to avoid high memory
+        # consumption from 1 rank.


Add more comments on this

Signed-off-by: angazenn <zengyanjia@huawei.com>

…_swiglu_quant (vllm-project#819)" This reverts commit 1e67089.

ttanzhiqiang · 2025-05-22T02:59:25Z

This is the best solution for A3 performance. Is there a best solution for A2 performance?

### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR #819 when running deepseekv3 series models: 1. #819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>

…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>

…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>

…quant (vllm-project#819)  ### What this PR does / why we need it? 1. This PR introduces native `all_to_all` communication operator to fix `allgather` bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs. 2. The operator `npu_dequant_swiglu_quant` only supports input hidden_states with dtype `torch.int32`. This tensor occupies space of `global_bs * seq_len * topk * hidden_size`, which might be very large as `ep_size` grows. Therefore we need to disable this operator and use original `swiglu` && `quantize`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>

…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>

…quant (vllm-project#819)  ### What this PR does / why we need it? 1. This PR introduces native `all_to_all` communication operator to fix `allgather` bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs. 2. The operator `npu_dequant_swiglu_quant` only supports input hidden_states with dtype `torch.int32`. This tensor occupies space of `global_bs * seq_len * topk * hidden_size`, which might be very large as `ep_size` grows. Therefore we need to disable this operator and use original `swiglu` && `quantize`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>

…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch from 61d42f7 to cb7693f Compare May 12, 2025 08:54

github-actions bot added module:ops module:quantization labels May 12, 2025

Angazenn force-pushed the all2all branch from cb7693f to 7c82ee5 Compare May 12, 2025 13:10

angazenn added 2 commits May 12, 2025 21:42

add all2all

03545de

Signed-off-by: angazenn <zengyanjia@huawei.com>

roll back to not fuse swiglu && quant

1c458fb

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch from 7c82ee5 to a867cb1 Compare May 12, 2025 13:44

add force load balance

5fa1814

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch 3 times, most recently from af2215a to d6cf7a1 Compare May 13, 2025 02:23

Angazenn changed the title ~~[WIP]add all2all when dp_size > 1~~ [WIP]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant May 13, 2025

fix yapf

f0b120f

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch 2 times, most recently from b2ca756 to e0ab8e0 Compare May 13, 2025 03:07

fix ruff

21fed64

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch from e0ab8e0 to 21fed64 Compare May 13, 2025 03:12

angazenn added 4 commits May 13, 2025 14:18

modify deepseekv2 moe

965a16b

Signed-off-by: angazenn <zengyanjia@huawei.com>

remove allgather

b49c6b9

Signed-off-by: angazenn <zengyanjia@huawei.com>

Merge remote-tracking branch 'upstream/main' into all2all

2eb4589

fix codecheck

50bf997

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch from acd2634 to 50bf997 Compare May 13, 2025 07:21

mv GroupCoordinatorPatch to worker

a1f1f81

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch from 891c113 to a1f1f81 Compare May 13, 2025 11:08

ganyi1996ppo reviewed May 13, 2025

View reviewed changes

Comment thread vllm_ascend/models/deepseek_v2.py

angazenn added 2 commits May 13, 2025 22:37

fix env variable

a0c8f2c

Signed-off-by: angazenn <zengyanjia@huawei.com>

remove unnecessary reduce-scatter patch

fda8418

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch 2 times, most recently from 3349fbd to 30eafb9 Compare May 13, 2025 15:04

add more comments for force load balance

3c27ba5

Signed-off-by: angazenn <zengyanjia@huawei.com>

Angazenn force-pushed the all2all branch from 30eafb9 to 3c27ba5 Compare May 13, 2025 15:09

Merge remote-tracking branch 'upstream/main' into all2all

e9a3435

Angazenn force-pushed the all2all branch from 7331fb6 to e9a3435 Compare May 14, 2025 07:33

Merge remote-tracking branch 'upstream/main' into all2all

6e7ea14

Angazenn changed the title ~~[WIP]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant~~ [BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant May 14, 2025

ganyi1996ppo approved these changes May 15, 2025

View reviewed changes

ganyi1996ppo merged commit 1e67089 into vllm-project:main May 15, 2025
15 checks passed

yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request May 17, 2025

Revert "[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant…

0f9d6a5

…_swiglu_quant (vllm-project#819)" This reverts commit 1e67089.

Angazenn mentioned this pull request May 19, 2025

[BugFix] Fix accuracy bugs for unquantized deepseekv3 models #897

Merged

linfeng-yuan mentioned this pull request May 20, 2025

[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_… #907

Closed

Angazenn deleted the all2all branch September 8, 2025 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant#819

[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant#819
ganyi1996ppo merged 15 commits intovllm-project:mainfrom
Angazenn:all2all

Angazenn commented May 12, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo May 13, 2025

Uh oh!

Angazenn May 14, 2025

Uh oh!

ganyi1996ppo May 13, 2025

Uh oh!

Angazenn May 13, 2025

Uh oh!

ganyi1996ppo May 13, 2025

Uh oh!

Angazenn May 13, 2025

Uh oh!

Uh oh!

Uh oh!

ttanzhiqiang commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Angazenn commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ganyi1996ppo May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Angazenn May 14, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Angazenn May 13, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Angazenn May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ttanzhiqiang commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Angazenn commented May 12, 2025 •

edited

Loading