[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight by LHXuuu · Pull Request #5718 · vllm-project/vllm-ascend

LHXuuu · 2026-01-08T06:58:39Z

What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format.

Support Moe model W8A8 Int8 dynamic weight.
Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@2f4e654

Signed-off-by: LHXuuu <scut_xlh@163.com>

gemini-code-assist

Code Review

This pull request extends quantization support in the VLLM Ascend engine for compressed tensors, specifically adding handling for Mixture-of-Experts (MoE) models with W8A8 Int8 dynamic weight quantization and specifying W4A16 quantization. The changes involve refactoring how quantization schemes are identified and applied, with new logic for FusedMoE layers.

My review focuses on the correctness of these changes. I've identified a high-severity issue where the new logic for FusedMoE layers is hardcoded to only consider the first expert, which could lead to incorrect quantization for models with multiple experts. The rest of the changes, including refactoring and extending quantization checks, appear to be well-implemented.

gemini-code-assist · 2026-01-08T07:00:16Z

+            unfused_names = [
+                prefix + proj_name
+                for proj_name in [".0.gate_proj", ".0.up_proj", ".0.down_proj"]
+            ]


The logic to determine the quantization scheme for FusedMoE layers is hardcoded to only check the projections of the first expert (expert 0). This is a significant limitation as noted in the TODO on line 179. It can lead to incorrect behavior for models with multiple experts, especially if they use different quantization schemes or if the naming convention for experts differs. The implementation should be generalized to iterate over all experts in the MoE layer to ensure consistent quantization is applied.

github-actions · 2026-01-08T07:00:28Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: LHXuuu <scut_xlh@163.com>

Signed-off-by: menogrey <1299267905@qq.com>

menogrey · 2026-01-13T01:43:44Z

gsm8k accuracy:

ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

Qwen3-235B-A22B

dataset	version	metric	mode	vllm-api-general-chat
gsm8kdataset	-	accuracy	gen	96.44

llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic

dataset	version	metric	mode	vllm-api-general-chat
gsm8kdataset	-	accuracy	gen	96.29

ceval accuracy:

ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

Qwen3-235B-A22B

dataset	version	metric	mode	vllm-api-general-chat
cevaldataset	-	accuracy	gen	90.86

llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic

dataset	version	metric	mode	vllm-api-general-chat
cevaldataset	-	accuracy	gen	90.27

mmlu accuracy:

ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

Qwen3-235B-A22B

dataset	version	metric	mode	vllm-api-general-chat
mmludataset	-	accuracy	gen	89.67

llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic

dataset	version	metric	mode	vllm-api-general-chat
mmludataset	-	accuracy	gen	88.83

kunpengW-code · 2026-01-13T01:55:02Z

mmlu accuracy:

ais_bench --models vllm_api_general --datasets mmlu_gen --merge-ds --debug

Qwen3-30B-A3B

dataset	version	metric	mode	vllm_api_general
mmludataset	-	accuracy	gen	79.58

llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic

dataset	version	metric	mode	vllm_api_general
mmludataset	-	accuracy	gen	79.42

gam8k accuracy:

ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_chat_prompt --dump-eval-details --debug

Qwen3-30B-A3B

dataset	version	metric	mode	vllm_api_general
gam8kdataset	-	accuracy	gen	96.44

llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic

dataset	version	metric	mode	vllm_api_general
gam8kdataset	-	accuracy	gen	96.21

ceval accuracy:

ais_bench --models vllm_api_general --datasets ceval_gen --merge-ds --debug --dump-eval-details

Qwen3-30B-A3B

dataset	version	metric	mode	vllm_api_general
cevaldataset	-	accuracy	gen	81.58

llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic

dataset	version	metric	mode	vllm_api_general
cevaldataset	-	accuracy	gen	81.50

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

Signed-off-by: menogrey <1299267905@qq.com>

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: [CI] Fix lint CI (vllm-project#5880) [Feature] implement eagle spec decoding for model runner v2 (vllm-project#5840) [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (vllm-project#5718) [EPLB][Bugfix] Get expert map from layers (vllm-project#5817) [Bugfix] Fixed an accuracy problem of sp with eagle3 (vllm-project#5816) [P/D] bugfix for p node force free requset (vllm-project#5431) [Lint]Style: Convert `example` to `ruff format` (vllm-project#5863) [Main2Main] Upgrade vllm commit to 0109 (vllm-project#5752) [Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (vllm-project#5846) [Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA (vllm-project#4075) [CustomOp][Perf] Merge Q/K split to simplify AscendApplyRotaryEmb for better performance (vllm-project#5799) [Lint]Style: Convert `root`, `benchmarks`, `tools` and `docs` to `ruff format` (vllm-project#5843) enable ep32 for dispatch_ffn_combine (vllm-project#5787)

vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>

vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>

vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>

support compressed tensors moe w8a8 int8 per token weight

9c13a02

Signed-off-by: LHXuuu <scut_xlh@163.com>

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

github-actions bot added the module:quantization label Jan 8, 2026

LHXuuu mentioned this pull request Jan 8, 2026

[RFC]: Support compressed tensors quantization for LLM Compressor #5350

Open

13 tasks

LHXuuu and others added 2 commits January 8, 2026 15:57

Clean code

5592381

Signed-off-by: LHXuuu <scut_xlh@163.com>

Add llm-compressor MoE w8a8 dynamic quantization example.

f119b5f

Signed-off-by: menogrey <1299267905@qq.com>

add e2e

c9b788e

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

menogrey added ready read for review ready-for-test start test by label for PR labels Jan 13, 2026

fix codespell

0a6b238

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

menogrey force-pushed the compressed_tensors_moe_w8a8_dynamic branch from a818286 to abbec29 Compare January 13, 2026 06:31

Add new model to CI, change E2E quantization model name.

0eabb0b

Signed-off-by: menogrey <1299267905@qq.com>

menogrey force-pushed the compressed_tensors_moe_w8a8_dynamic branch from abbec29 to 0eabb0b Compare January 13, 2026 06:32

menogrey added 2 commits January 13, 2026 07:09

Change testcase golden results.

12ace8c

Signed-off-by: menogrey <1299267905@qq.com>

Fix lint error.

9339d00

Signed-off-by: menogrey <1299267905@qq.com>

wangxiyuan added the model-download label Jan 13, 2026

wangxiyuan approved these changes Jan 13, 2026

View reviewed changes

kunpengW-code added 2 commits January 13, 2026 19:23

fix w4a16 error

f88e0d6

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

fix w4a16 error

148cfa3

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

wangxiyuan merged commit 0415e69 into vllm-project:main Jan 14, 2026
17 checks passed

LHXuuu deleted the compressed_tensors_moe_w8a8_dynamic branch January 14, 2026 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight#5718

[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight#5718
wangxiyuan merged 10 commits intovllm-project:mainfrom
LHXuuu:compressed_tensors_moe_w8a8_dynamic

LHXuuu commented Jan 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

menogrey commented Jan 13, 2026 •

edited

Loading

Uh oh!

kunpengW-code commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LHXuuu commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

menogrey commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunpengW-code commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LHXuuu commented Jan 8, 2026 •

edited

Loading

menogrey commented Jan 13, 2026 •

edited

Loading

kunpengW-code commented Jan 13, 2026 •

edited

Loading