Skip to content

[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight#5718

Merged
wangxiyuan merged 10 commits intovllm-project:mainfrom
LHXuuu:compressed_tensors_moe_w8a8_dynamic
Jan 14, 2026
Merged

[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight#5718
wangxiyuan merged 10 commits intovllm-project:mainfrom
LHXuuu:compressed_tensors_moe_w8a8_dynamic

Conversation

@LHXuuu
Copy link
Copy Markdown
Contributor

@LHXuuu LHXuuu commented Jan 8, 2026

What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format.

  1. Support Moe model W8A8 Int8 dynamic weight.
  2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: LHXuuu <scut_xlh@163.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends quantization support in the VLLM Ascend engine for compressed tensors, specifically adding handling for Mixture-of-Experts (MoE) models with W8A8 Int8 dynamic weight quantization and specifying W4A16 quantization. The changes involve refactoring how quantization schemes are identified and applied, with new logic for FusedMoE layers.

My review focuses on the correctness of these changes. I've identified a high-severity issue where the new logic for FusedMoE layers is hardcoded to only consider the first expert, which could lead to incorrect quantization for models with multiple experts. The rest of the changes, including refactoring and extending quantization checks, appear to be well-implemented.

Comment on lines +175 to +178
unfused_names = [
prefix + proj_name
for proj_name in [".0.gate_proj", ".0.up_proj", ".0.down_proj"]
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to determine the quantization scheme for FusedMoE layers is hardcoded to only check the projections of the first expert (expert 0). This is a significant limitation as noted in the TODO on line 179. It can lead to incorrect behavior for models with multiple experts, especially if they use different quantization schemes or if the naming convention for experts differs. The implementation should be generalized to iterate over all experts in the MoE layer to ensure consistent quantization is applied.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 8, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

LHXuuu and others added 2 commits January 8, 2026 15:57
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
@menogrey
Copy link
Copy Markdown
Collaborator

menogrey commented Jan 13, 2026

gsm8k accuracy:

ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds 

Qwen3-235B-A22B

dataset version metric mode vllm-api-general-chat
gsm8kdataset - accuracy gen 96.44

llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic

dataset version metric mode vllm-api-general-chat
gsm8kdataset - accuracy gen 96.29

ceval accuracy:

ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds 

Qwen3-235B-A22B

dataset version metric mode vllm-api-general-chat
cevaldataset - accuracy gen 90.86

llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic

dataset version metric mode vllm-api-general-chat
cevaldataset - accuracy gen 90.27

mmlu accuracy:

ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

Qwen3-235B-A22B

dataset version metric mode vllm-api-general-chat
mmludataset - accuracy gen 89.67

llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic

dataset version metric mode vllm-api-general-chat
mmludataset - accuracy gen 88.83

@kunpengW-code
Copy link
Copy Markdown
Contributor

kunpengW-code commented Jan 13, 2026

mmlu accuracy:

ais_bench --models vllm_api_general --datasets mmlu_gen --merge-ds --debug

Qwen3-30B-A3B

dataset version metric mode vllm_api_general
mmludataset - accuracy gen 79.58

llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic

dataset version metric mode vllm_api_general
mmludataset - accuracy gen 79.42

gam8k accuracy:

ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_chat_prompt --dump-eval-details --debug

Qwen3-30B-A3B

dataset version metric mode vllm_api_general
gam8kdataset - accuracy gen 96.44

llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic

dataset version metric mode vllm_api_general
gam8kdataset - accuracy gen 96.21

ceval accuracy:

ais_bench --models vllm_api_general --datasets ceval_gen --merge-ds --debug --dump-eval-details

Qwen3-30B-A3B

dataset version metric mode vllm_api_general
cevaldataset - accuracy gen 81.58

llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic

dataset version metric mode vllm_api_general
cevaldataset - accuracy gen 81.50

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@menogrey menogrey added ready read for review ready-for-test start test by label for PR labels Jan 13, 2026
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@menogrey menogrey force-pushed the compressed_tensors_moe_w8a8_dynamic branch from a818286 to abbec29 Compare January 13, 2026 06:31
Signed-off-by: menogrey <1299267905@qq.com>
@menogrey menogrey force-pushed the compressed_tensors_moe_w8a8_dynamic branch from abbec29 to 0eabb0b Compare January 13, 2026 06:32
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@wangxiyuan wangxiyuan merged commit 0415e69 into vllm-project:main Jan 14, 2026
17 checks passed
@LHXuuu LHXuuu deleted the compressed_tensors_moe_w8a8_dynamic branch January 14, 2026 02:25
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 14, 2026
…to eplb_refactor

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [CI] Fix lint CI (vllm-project#5880)
  [Feature] implement eagle spec decoding for model runner v2 (vllm-project#5840)
  [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (vllm-project#5718)
  [EPLB][Bugfix] Get expert map from layers (vllm-project#5817)
  [Bugfix] Fixed an accuracy problem of sp with eagle3 (vllm-project#5816)
  [P/D] bugfix for p node force free requset (vllm-project#5431)
  [Lint]Style: Convert `example` to `ruff format` (vllm-project#5863)
  [Main2Main] Upgrade vllm commit to 0109 (vllm-project#5752)
  [Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (vllm-project#5846)
  [Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA (vllm-project#4075)
  [CustomOp][Perf] Merge Q/K split to simplify AscendApplyRotaryEmb for better performance (vllm-project#5799)
  [Lint]Style: Convert `root`, `benchmarks`, `tools` and `docs` to `ruff format` (vllm-project#5843)
  enable ep32 for dispatch_ffn_combine (vllm-project#5787)
aipaes pushed a commit to aipaes/vllm-ascend that referenced this pull request Jan 15, 2026
vllm-project#5718)

### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
vllm-project#5718)

### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
vllm-project#5718)

### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
vllm-project#5718)

### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
vllm-project#5718)

### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
vllm-project#5718)

### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-download module:quantization ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants