[Quantization][Feature] Add AWQ quantization in vllm-ascend. by menogrey · Pull Request #4316 · vllm-project/vllm-ascend

menogrey · 2025-11-20T13:11:33Z

What this PR does / why we need it?

Add AWQ quantization in vllm-ascend. Most of the code refer to existing implement , and new quantization adaptation refer to compressed tensor: #4036 .
Known issue:

Use export VLLM_ASCEND_ENABLE_NZ=0 to avoid error when processing weights.
expert parallel is not supported now.
Some models have accuracy issue, Qwen3-235B-A22B-Instruct-AWQ Qwen3-VL-235B-A22B-Instruct-AWQ and Qwen2.5-VL-32B-Instruct-AWQ. More infomation at [Feature]: Add AWQ quantization support for vllm-ascend #4378

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request introduces support for AWQ quantization in vllm-ascend. The changes are well-structured, adding the necessary configurations and Ascend-specific implementations for AWQ, including linear and MoE layers. A key improvement is the added robustness in AscendRMSNorm to handle different quantization configurations without crashing. My review has identified a couple of redundant function calls within the new npu_fused_experts function that should be removed to improve performance.

gemini-code-assist · 2025-11-20T13:13:19Z

+    # gmm1: gate_up_proj
+    hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
+    if not use_wna16:
+        hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)


The call to torch_npu.npu_dynamic_quant() on this line is redundant, as it's already been called unconditionally on line 79. This duplicate call is unnecessary and negatively impacts performance. It should be removed. A similar issue is present on line 108.

gemini-code-assist · 2025-11-20T13:13:19Z

+    hidden_states = torch_npu.npu_swiglu(hidden_states)
+    hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
+    if not use_wna16:
+        hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)


Similar to the issue on line 81, this call to torch_npu.npu_dynamic_quant() is redundant because it was already called on line 106. Please remove this unnecessary duplicate call to avoid performance degradation.

github-actions · 2025-11-20T13:17:05Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-11-21T07:06:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

MengqingCao · 2025-11-26T07:58:21Z

@paulyu12 this pr implement AWQ quantization, and now it is under testing, just at you to take a look

menogrey · 2025-11-26T08:01:08Z

Validation on this issue #4378

github-actions · 2025-11-28T06:12:15Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan · 2025-11-28T06:14:02Z

Please rebase to main now.

SlightwindSec · 2025-12-02T01:39:29Z

+    def __init__(self, quant_config: AWQQuantConfig):
+        self.quant_config = quant_config
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:


This is a solid solution for unifying weight formats.
By preprocessing the weights during the loading phase, you effectively resolve the packing layout discrepancy between GPU and NPU, which is critical for compatibility.

Optimization Suggestions:
However, for larger models, the current for loop might introduce noticeable latency due to repeated kernel launches. Additionally, bitwise shift operations (>>, <<) can be less efficient than arithmetic operations on NPU hardware.

I suggest refactoring this implementation to vectorize the logic (eliminating the loop) and replace the bitwise shifts with equivalent multiplication and integer division. This should significantly improve the loading performance.

Thanks for your suggestions. I try to remove the for loop but an OOM problem occurred, and the fact is if you want to execute the unpack operation, you have to alloc a big tensor (8x times memory for one tensor), and need more temporary variable if not using a in-place operation (like the bitwise operation). I think it is a trade-off between the memory usage and execution efficiency, but for a quantization, the memory usage is more important, so i won't change it, unless you have some better suggestions. @SlightwindSec

Fair point. I agree that avoiding OOM is the top priority here.

github-actions · 2025-12-05T02:44:29Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: menogrey <1299267905@qq.com>

DeepSeek-V3.1-AWQ. Signed-off-by: menogrey <1299267905@qq.com>

Signed-off-by: menogrey <1299267905@qq.com>

github-actions · 2025-12-24T12:00:31Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

ZhongsJie · 2026-03-23T02:58:04Z

Any progress on this?

menogrey · 2026-03-24T11:50:57Z

Any progress on this?

@ZhongsJie I will resume work on this PR following the completion of the quantization module refactoring.

ZhongsJie · 2026-03-26T06:56:48Z

Any progress on this?

@ZhongsJie I will resume work on this PR following the completion of the quantization module refactoring.

@menogrey I’m reworking this part in #7672 — if you have time, could you take a look?

menogrey · 2026-03-26T11:53:22Z

Any progress on this?

@ZhongsJie I will resume work on this PR following the completion of the quantization module refactoring.

@menogrey I’m reworking this part in #7672 — if you have time, could you take a look?

That's great! In addition there are two reason that this work is pending

I want to refactor the quantization module to align with the vllm upstream cause the upstream use better kernel abstraction. Refer to [Quantization] Refactor compressed-tensors quantization implement to reuse upstream implement. And add w4a16 support. #6644
The current w4a16 implement have some issues on some models as i showed above. And you should refer to the modelslim w4a16 implement.[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights #4516

menogrey marked this pull request as draft November 20, 2025 13:12

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

github-actions bot added module:ops module:core module:quantization labels Nov 20, 2025

github-actions bot added the merge-conflicts label Nov 21, 2025

menogrey force-pushed the add_awq_quant2 branch from d1be882 to 9fe852c Compare November 25, 2025 09:01

menogrey mentioned this pull request Nov 26, 2025

[Feature]: Add AWQ quantization support for vllm-ascend #4378

Open

menogrey force-pushed the add_awq_quant2 branch from 4e14b12 to 4faf918 Compare November 27, 2025 07:25

github-actions bot added module:tests and removed merge-conflicts labels Nov 27, 2025

menogrey force-pushed the add_awq_quant2 branch from 44fdc96 to 5953663 Compare November 27, 2025 13:18

github-actions bot added the merge-conflicts label Nov 28, 2025

menogrey force-pushed the add_awq_quant2 branch 2 times, most recently from 7251f58 to ce953d2 Compare November 29, 2025 03:17

github-actions bot removed the merge-conflicts label Nov 29, 2025

menogrey marked this pull request as ready for review December 1, 2025 11:05

menogrey changed the title ~~[Draft][Quantization][Feature] Add AWQ quantization in vllm-ascend.~~ [Quantization][Feature] Add AWQ quantization in vllm-ascend. Dec 1, 2025

SlightwindSec reviewed Dec 2, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Dec 5, 2025

menogrey force-pushed the add_awq_quant2 branch from fa8b76b to f8e7ea5 Compare December 10, 2025 08:19

github-actions bot removed the merge-conflicts label Dec 10, 2025

[Quantization][Feature] Add AWQ quantization in vllm-ascend.

9a459e2

Signed-off-by: menogrey <1299267905@qq.com>

menogrey added 10 commits December 12, 2025 16:14

[Quantization] Fix AWQ FusedMoE layer skipping, for some models like

b5e527a

DeepSeek-V3.1-AWQ. Signed-off-by: menogrey <1299267905@qq.com>

[Quantization] Fix AWQ supported dtypes, add BF16.

61c582b

Signed-off-by: menogrey <1299267905@qq.com>

[Quantization] Remove unused import.

ff94f16

Signed-off-by: menogrey <1299267905@qq.com>

Fix UT test and add AWQ model CI.

7881aa3

Signed-off-by: menogrey <1299267905@qq.com>

Fix lint check issue.

917cbb4

Signed-off-by: menogrey <1299267905@qq.com>

Add AWQ quantization UT testcase.

619ea59

Signed-off-by: menogrey <1299267905@qq.com>

Optimize quant_config testcase.

ccc5370

Signed-off-by: menogrey <1299267905@qq.com>

Fix AWQ moe exception after upgrade.

b32b56b

Signed-off-by: menogrey <1299267905@qq.com>

Fix lint check issue.

705ba1d

Signed-off-by: menogrey <1299267905@qq.com>

Fix lint.

17067ab

Signed-off-by: menogrey <1299267905@qq.com>

menogrey force-pushed the add_awq_quant2 branch from ccd956d to 17067ab Compare December 12, 2025 09:30

menogrey added ready read for review ready-for-test start test by label for PR labels Dec 12, 2025

menogrey and others added 4 commits December 15, 2025 09:22

Fix lint.

2df1124

Signed-off-by: menogrey <1299267905@qq.com>

Merge branch 'main' into add_awq_quant2

21a92f5

Merge branch 'main' into add_awq_quant2

5c09779

Merge branch 'main' into add_awq_quant2

bb105fa

github-actions bot added the merge-conflicts label Dec 24, 2025

wangxiyuan mentioned this pull request Jan 4, 2026

vLLM Ascend Roadmap Q1 2026 #5318

Open

22 tasks

wangxiyuan added the guide guide note label Jan 5, 2026

ZhongsJie mentioned this pull request Mar 26, 2026

[Quantization][Feature] Add AWQ quantization for Ascend #7672

Open

Conversation

menogrey commented Nov 20, 2025 • edited by Yikun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

MengqingCao commented Nov 26, 2025

Uh oh!

menogrey commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

wangxiyuan commented Nov 28, 2025

Uh oh!

SlightwindSec Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

menogrey Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

ZhongsJie commented Mar 23, 2026

Uh oh!

menogrey commented Mar 24, 2026

Uh oh!

ZhongsJie commented Mar 26, 2026

Uh oh!

menogrey commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

menogrey commented Nov 20, 2025 •

edited by Yikun

Loading