[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight#4036
[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight#4036wangxiyuan merged 24 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request adds support for w8a8 static and dynamic quantization using the compressed tensors format on Ascend hardware. The changes include a new AscendCompressedTensorsConfig, corresponding quantization schemes, and integration into the vLLM-Ascend platform and worker.
The implementation looks good overall, but I've found a few issues:
- A critical bug in
AscendCompressedTensorsConfigthat could lead to a runtime crash due to a missingNonecheck. - Some robustness issues, such as an unsafe list removal and the use of
assertfor configuration validation, which could cause crashes. - A performance issue in the
w8a8static quantization scheme where atransposeoperation is inefficiently performed on every forward pass.
I've provided detailed comments and suggestions to address these points.
| if is_310p(): | ||
| # On 300I Duo platform, we need transpose again if | ||
| # using nz. This transpose can be skipped in torchair. | ||
| output = torch_npu.npu_quant_matmul( | ||
| x, | ||
| layer.weight.data.transpose(1, 0), | ||
| layer.deq_scale, | ||
| bias=bias, | ||
| output_dtype=layer.params_dtype, | ||
| ) |
There was a problem hiding this comment.
The transpose operation on layer.weight.data is performed on every forward pass for the is_310p() case, which is inefficient. The transposed weight should be computed once and cached to improve performance. A good place for this one-time operation would be in process_weights_after_loading.
if is_310p():
# On 300I Duo platform, we need transpose again if
# using nz. This transpose can be skipped in torchair.
# The transpose is cached to avoid re-computation on every forward pass.
if not hasattr(layer, "_weight_transposed_for_310p"):
layer._weight_transposed_for_310p = layer.weight.data.transpose(1, 0).contiguous()
output = torch_npu.npu_quant_matmul(
x,
layer._weight_transposed_for_310p,
layer.deq_scale,
bias=bias,
output_dtype=layer.params_dtype,
)|
Thanks for this great work! Could you plz add an e2e test of w8a8 static and dynamic quant? And ut is also expected, but we could add ut in the follow-up prs. And is there any accuracy and performance mertics of your pr? also cc @wangxiyuan @22dimensions |
|
You can solve the DCO and lint issues by referring to the contributing doc in https://vllm-ascend.readthedocs.io/ |
Thanks for your reply. I’m currently running accuracy and performance tests. Once they’re complete, I’ll post them in the comment. |
bfc2302 to
ad6ab6d
Compare
|
@MengqingCao @wangxiyuan Hi! The precision results are shown in the table below. W8A8 static weights fall back to all down_proj linear, while w8a8 dynamic weights are fully quantized. Qwen3-32b precision test
|
79c3d6a to
d8b7eed
Compare
402a5f2 to
9d969ee
Compare
Signed-off-by: LHXuuu <scut_xlh@163.com>
…o reuse ModelSlim code for quantization Signed-off-by: LHXuuu <scut_xlh@163.com>
9d969ee to
d7133b3
Compare
5e48c95 to
c3ce099
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: chenxi-hh <chen464822955@163.com>
…cend into compressor_tensor
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
|
@MengqingCao @wangxiyuan Hello, this pr is ready to merge. |
…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
| "Falling back to UnquantizedLinearMethod") | ||
| return None | ||
|
|
||
| else: |
There was a problem hiding this comment.
redundant else, you could remove it.
|
|
||
| @classmethod | ||
| def get_supported_act_dtypes(cls) -> list[torch.dtype]: | ||
| return [torch.int8, torch.float16, torch.bfloat16] |
There was a problem hiding this comment.
it's good to return (torch.int8, torch.float16, torch.bfloat16)
|
|
||
| @classmethod | ||
| def get_config_filenames(cls) -> list[str]: | ||
| return [] |
| # Only symmetric weight quantization supported. | ||
| return is_8_bits and is_tensor and is_symmetric and is_static | ||
|
|
||
| def _is_dynamic_token_w8a8(self, weight_quant: QuantizationArgs, |
There was a problem hiding this comment.
_is_dynamic_token_w8a8 and _is_static_tensor_w8a8 is duplicated, please extract a method
…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
…c weight (vllm-project#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format.
Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com
Does this PR introduce any user-facing change?
No
How was this patch tested?