[main] Support `GroupedMatmulSwigluQuant` in `W8A8_DYNAMIC` quantized MoE layers by zhoux77899 · Pull Request #3 · zhoux77899/vllm-ascend

zhoux77899 · 2025-08-13T07:49:23Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

…C` quantized MoE layers Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

…llm-ascend into main_gmmswigluquant

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

github-actions · 2025-08-13T07:49:32Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

…m-project#2347) ### What this PR does / why we need it? Add Docker export/import guide for air-gapped environments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@d16aa3d Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

…lm-project#2319) The attn mask was declared in the mla.py，we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 --------- Signed-off-by: haojiangzheng <justineric096@gmail.com>

… init (vllm-project#2348) ### What this PR does / why we need it? Add the missing `apply_router_weight_on_input` in FusedMoE init Quick fix on vllm-project#2268 (comment) ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@6807af8 Signed-off-by: MengqingCao <cmq0113@163.com>

…project#1995) ### What this PR does / why we need it? Refactor V1 Attention for better extensibility (prepared for torchair attention refactor). **Main changes:** - Move different kinds of foward into their method respectively, e.g., `_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`, `_forward_decode_only()`, `_forward_v1_style()`. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: shen-shanshan <467638484@qq.com>

…llm-project#2193) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@71683ca --------- Signed-off-by: shen-shanshan <467638484@qq.com>

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

github-actions · 2025-08-14T03:06:40Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

…oject#5978) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/mla_v1.py` | | `vllm_ascend/attention/sfa_v1.py` | | `vllm_ascend/core/recompute_scheduler.py` | | `vllm_ascend/core/scheduler_dynamic_batch.py` | | `vllm_ascend/distributed/device_communicators/npu_communicator.py` | | `vllm_ascend/distributed/device_communicators/pyhccl.py` | | `vllm_ascend/distributed/device_communicators/pyhccl_wrapper.py` | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: Soren <user@SorendeMac-mini.local>

zhoux77899 and others added 20 commits August 8, 2025 11:51

feat(performance): support GroupedMatmulSwigluQuant in `W8A8_DYNAMI…

a5aefdf

…C` quantized MoE layers Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

0f688cd

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(bug): fix bug

840c03f

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

feat(ops): enable grouped_matmul_swiglu_quant by default

cdf5e1e

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

c3c0913

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(test): fix broken test

f05687f

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

4f3afe6

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(test): temporally skip broken test due to oom

3b32dc8

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(test): change bias1 to tensor

a3c9b44

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

Merge branch 'main' into main_gmmswigluquant

67e9872

fix(bug): update group_list handling and weight scale in dynamic methods

68e31db

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

a3715ec

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

58d6371

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

feat(ops): replace all splited gmm and swiglu

a46315d

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

Merge branch 'main_gmmswigluquant' of https://github.com/zhoux77899/v…

5ee5a83

…llm-ascend into main_gmmswigluquant

fix(lint): fix lint

0ea5246

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

feat(quantization): split w4a8 and w8a8 apply

d9b16fc

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(test): replace w8a8 function in apply

9ade98e

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

feat(cumsum): add cumsum_group_list function for group list processing

6af87be

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

aed8264

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

zhoux77899 and others added 9 commits August 13, 2025 15:59

fix(lint): fix lint

c40f61d

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

feat(torchair): consider not using gmmswigluquant when torchair enabled

c437d39

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

3309579

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

Merge branch 'main' into main_gmmswigluquant

04523e6

github-actions bot added the merge-conflicts label Aug 14, 2025

zhoux77899 and others added 3 commits August 14, 2025 11:23

fix(dtype): unify w1_scale dtype

3d2b849

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

fix(lint): fix lint

51ec3d8

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

Merge branch 'main_qwen3_moe_optim' into main_gmmswigluquant

af72a56

github-actions bot removed the merge-conflicts label Aug 14, 2025

fix(lint): fix lint

7e83993

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

zhoux77899 merged commit 68dc825 into main_qwen3_moe_optim Aug 14, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[main] Support `GroupedMatmulSwigluQuant` in `W8A8_DYNAMIC` quantized MoE layers#3

[main] Support `GroupedMatmulSwigluQuant` in `W8A8_DYNAMIC` quantized MoE layers#3
zhoux77899 merged 33 commits intomain_qwen3_moe_optimfrom
main_gmmswigluquant

zhoux77899 commented Aug 13, 2025

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhoux77899 commented Aug 13, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants