[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight by LHXuuu · Pull Request #4036 · vllm-project/vllm-ascend

LHXuuu · 2025-11-06T09:57:54Z

What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format.

Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm.
Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.
Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.
Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

github-actions · 2025-11-06T09:58:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for w8a8 static and dynamic quantization using the compressed tensors format on Ascend hardware. The changes include a new AscendCompressedTensorsConfig, corresponding quantization schemes, and integration into the vLLM-Ascend platform and worker.

The implementation looks good overall, but I've found a few issues:

A critical bug in AscendCompressedTensorsConfig that could lead to a runtime crash due to a missing None check.
Some robustness issues, such as an unsafe list removal and the use of assert for configuration validation, which could cause crashes.
A performance issue in the w8a8 static quantization scheme where a transpose operation is inefficiently performed on every forward pass.

I've provided detailed comments and suggestions to address these points.

gemini-code-assist · 2025-11-06T10:01:24Z

+        if is_310p():
+            # On 300I Duo platform, we need transpose again if
+            # using nz. This transpose can be skipped in torchair.
+            output = torch_npu.npu_quant_matmul(
+                x,
+                layer.weight.data.transpose(1, 0),
+                layer.deq_scale,
+                bias=bias,
+                output_dtype=layer.params_dtype,
+            )


The transpose operation on layer.weight.data is performed on every forward pass for the is_310p() case, which is inefficient. The transposed weight should be computed once and cached to improve performance. A good place for this one-time operation would be in process_weights_after_loading.

if is_310p(): # On 300I Duo platform, we need transpose again if # using nz. This transpose can be skipped in torchair. # The transpose is cached to avoid re-computation on every forward pass. if not hasattr(layer, "_weight_transposed_for_310p"): layer._weight_transposed_for_310p = layer.weight.data.transpose(1, 0).contiguous() output = torch_npu.npu_quant_matmul( x, layer._weight_transposed_for_310p, layer.deq_scale, bias=bias, output_dtype=layer.params_dtype, )

MengqingCao · 2025-11-07T07:16:00Z

Thanks for this great work! Could you plz add an e2e test of w8a8 static and dynamic quant? And ut is also expected, but we could add ut in the follow-up prs.

And is there any accuracy and performance mertics of your pr?

also cc @wangxiyuan @22dimensions

MengqingCao · 2025-11-07T07:19:58Z

You can solve the DCO and lint issues by referring to the contributing doc in https://vllm-ascend.readthedocs.io/

LHXuuu · 2025-11-07T07:51:57Z

Thanks for this great work! Could you plz add an e2e test of w8a8 static and dynamic quant? And ut is also expected, but we could add ut in the follow-up prs.

And is there any accuracy and performance mertics of your pr?

also cc @wangxiyuan @22dimensions

Thanks for your reply. I’m currently running accuracy and performance tests. Once they’re complete, I’ll post them in the comment.

LHXuuu · 2025-11-12T15:56:55Z

@MengqingCao @wangxiyuan Hi! The precision results are shown in the table below. W8A8 static weights fall back to all down_proj linear, while w8a8 dynamic weights are fully quantized.

Qwen3-32b precision test

	ceval	gsm8k	mmlu
BF16	88.94	95.45	89.27
w8a8 static	89.03	95.45	88.91
w8a8 dynamic	88.72	96.51	89.16

Signed-off-by: LHXuuu <scut_xlh@163.com>

…o reuse ModelSlim code for quantization Signed-off-by: LHXuuu <scut_xlh@163.com>

github-actions · 2025-11-21T07:06:49Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: chenxi-hh <chen464822955@163.com>

…cend into compressor_tensor

Signed-off-by: chenxi-hh <chen464822955@163.com>

Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>

Signed-off-by: chenxi-hh <chen464822955@163.com>

LHXuuu · 2025-11-28T05:49:25Z

@MengqingCao @wangxiyuan Hello, this pr is ready to merge.

…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>

…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

Ronald1995 · 2025-11-14T02:04:43Z

+                                "Falling back to UnquantizedLinearMethod")
+            return None
+
+        else:


redundant else, you could remove it.

Ronald1995 · 2025-11-14T02:06:30Z

+
+    @classmethod
+    def get_supported_act_dtypes(cls) -> list[torch.dtype]:
+        return [torch.int8, torch.float16, torch.bfloat16]


it's good to return (torch.int8, torch.float16, torch.bfloat16)

Ronald1995 · 2025-11-14T02:06:51Z

+
+    @classmethod
+    def get_config_filenames(cls) -> list[str]:
+        return []


it's good to return ()

Ronald1995 · 2025-11-14T02:09:45Z

+        # Only symmetric weight quantization supported.
+        return is_8_bits and is_tensor and is_symmetric and is_static
+
+    def _is_dynamic_token_w8a8(self, weight_quant: QuantizationArgs,


_is_dynamic_token_w8a8 and _is_static_tensor_w8a8 is duplicated, please extract a method

…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>

…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>

…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>

github-actions bot added module:core module:quantization labels Nov 6, 2025

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

LHXuuu closed this Nov 7, 2025

LHXuuu reopened this Nov 7, 2025

LHXuuu force-pushed the compressor_tensor branch 3 times, most recently from bfc2302 to ad6ab6d Compare November 11, 2025 08:28

LHXuuu force-pushed the compressor_tensor branch 2 times, most recently from 79c3d6a to d8b7eed Compare November 13, 2025 06:25

wangxiyuan reviewed Nov 13, 2025

View reviewed changes

Comment thread vllm_ascend/quantization/compressed_tensors/compressed_tensors.py

Comment thread vllm_ascend/worker/worker_v1.py Outdated

Comment thread vllm_ascend/quantization/compressed_tensors/schemes/compressed_tensors_w8a8.py Outdated

linfeng-yuan reviewed Nov 13, 2025

View reviewed changes

Comment thread vllm_ascend/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_dynamic.py Outdated

github-actions bot added the documentation Improvements or additions to documentation label Nov 18, 2025

LHXuuu force-pushed the compressor_tensor branch 4 times, most recently from 402a5f2 to 9d969ee Compare November 20, 2025 06:50

LHXuuu added 2 commits November 20, 2025 14:50

support compressed tensors w8a8 static and dynamic quantization

b41642a

Signed-off-by: LHXuuu <scut_xlh@163.com>

Refactoring the quantization functionality to enable LLM Compressor t…

d7133b3

…o reuse ModelSlim code for quantization Signed-off-by: LHXuuu <scut_xlh@163.com>

LHXuuu force-pushed the compressor_tensor branch from 9d969ee to d7133b3 Compare November 20, 2025 06:50

chenxi-hh force-pushed the compressor_tensor branch from 5e48c95 to c3ce099 Compare November 20, 2025 12:56

github-actions bot added the module:tests label Nov 20, 2025

github-actions bot added merge-conflicts and removed merge-conflicts labels Nov 21, 2025

chenxi-hh added 2 commits November 25, 2025 16:19

CI problems

7a07681

Signed-off-by: chenxi-hh <chen464822955@163.com>

Merge branch 'compressor_tensor' of https://github.com/LHXuuu/vllm-as…

2ed54fc

…cend into compressor_tensor

SlightwindSec reviewed Nov 25, 2025

View reviewed changes

Comment thread vllm_ascend/quantization/w8a8.py

chenxi-hh added 3 commits November 25, 2025 17:41

CI problems

7c1848e

Signed-off-by: chenxi-hh <chen464822955@163.com>

CI problems

85283b1

Signed-off-by: chenxi-hh <chen464822955@163.com>

CI problems

24f6596

Signed-off-by: chenxi-hh <chen464822955@163.com>

MengqingCao reviewed Nov 26, 2025

View reviewed changes

menogrey mentioned this pull request Nov 26, 2025

[Quantization][Feature] Add AWQ quantization in vllm-ascend. #4316

Open

chenxi-hh and others added 8 commits November 26, 2025 14:32

CI problems

8c59b6c

Signed-off-by: chenxi-hh <chen464822955@163.com>

Merge branch 'main' into compressor_tensor

a88bad7

Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>

CI problems

7c01955

Signed-off-by: chenxi-hh <chen464822955@163.com>

CI problems

2f24a00

Signed-off-by: chenxi-hh <chen464822955@163.com>

CI problems

e7e1100

Signed-off-by: chenxi-hh <chen464822955@163.com>

CI problems

14354e9

Signed-off-by: chenxi-hh <chen464822955@163.com>

CI problems

b62bf8c

Signed-off-by: chenxi-hh <chen464822955@163.com>

CI problems

6dbc2b3

Signed-off-by: chenxi-hh <chen464822955@163.com>

wangxiyuan approved these changes Nov 28, 2025

View reviewed changes

wangxiyuan merged commit bdc6697 into vllm-project:main Nov 28, 2025
22 checks passed

Ronald1995 reviewed Dec 5, 2025

View reviewed changes

LHXuuu mentioned this pull request Dec 25, 2025

[RFC]: Support compressed tensors quantization for LLM Compressor #5350

Open

13 tasks

menogrey mentioned this pull request Jan 8, 2026

[RFC] Add NPU support vllm-project/llm-compressor#2199

Closed

LHXuuu deleted the compressor_tensor branch January 14, 2026 02:27

Conversation

LHXuuu commented Nov 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Nov 7, 2025

Uh oh!

LHXuuu commented Nov 7, 2025

Uh oh!

LHXuuu commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-32b precision test

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LHXuuu commented Nov 28, 2025

Uh oh!

Uh oh!

Ronald1995 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Ronald1995 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Ronald1995 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Ronald1995 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

LHXuuu commented Nov 6, 2025 •

edited by github-actions bot

Loading

MengqingCao commented Nov 7, 2025 •

edited

Loading

LHXuuu commented Nov 12, 2025 •

edited

Loading