[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight#5718
Conversation
Signed-off-by: LHXuuu <scut_xlh@163.com>
There was a problem hiding this comment.
Code Review
This pull request extends quantization support in the VLLM Ascend engine for compressed tensors, specifically adding handling for Mixture-of-Experts (MoE) models with W8A8 Int8 dynamic weight quantization and specifying W4A16 quantization. The changes involve refactoring how quantization schemes are identified and applied, with new logic for FusedMoE layers.
My review focuses on the correctness of these changes. I've identified a high-severity issue where the new logic for FusedMoE layers is hardcoded to only consider the first expert, which could lead to incorrect quantization for models with multiple experts. The rest of the changes, including refactoring and extending quantization checks, appear to be well-implemented.
| unfused_names = [ | ||
| prefix + proj_name | ||
| for proj_name in [".0.gate_proj", ".0.up_proj", ".0.down_proj"] | ||
| ] |
There was a problem hiding this comment.
The logic to determine the quantization scheme for FusedMoE layers is hardcoded to only check the projections of the first expert (expert 0). This is a significant limitation as noted in the TODO on line 179. It can lead to incorrect behavior for models with multiple experts, especially if they use different quantization schemes or if the naming convention for experts differs. The implementation should be generalized to iterate over all experts in the MoE layer to ensure consistent quantization is applied.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
|
gsm8k accuracy: Qwen3-235B-A22B
llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic
ceval accuracy: Qwen3-235B-A22B
llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic
mmlu accuracy: Qwen3-235B-A22B
llmcompressor Qwen3-235B-A22B w8a8 int8 dynamic
|
|
mmlu accuracy: Qwen3-30B-A3B
llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic
gam8k accuracy: Qwen3-30B-A3B
llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic
ceval accuracy: Qwen3-30B-A3B
llmcompressor Qwen3-30B-A3B w8a8 int8 dynamic
|
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
a818286 to
abbec29
Compare
Signed-off-by: menogrey <1299267905@qq.com>
abbec29 to
0eabb0b
Compare
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: [CI] Fix lint CI (vllm-project#5880) [Feature] implement eagle spec decoding for model runner v2 (vllm-project#5840) [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (vllm-project#5718) [EPLB][Bugfix] Get expert map from layers (vllm-project#5817) [Bugfix] Fixed an accuracy problem of sp with eagle3 (vllm-project#5816) [P/D] bugfix for p node force free requset (vllm-project#5431) [Lint]Style: Convert `example` to `ruff format` (vllm-project#5863) [Main2Main] Upgrade vllm commit to 0109 (vllm-project#5752) [Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (vllm-project#5846) [Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA (vllm-project#4075) [CustomOp][Perf] Merge Q/K split to simplify AscendApplyRotaryEmb for better performance (vllm-project#5799) [Lint]Style: Convert `root`, `benchmarks`, `tools` and `docs` to `ruff format` (vllm-project#5843) enable ep32 for dispatch_ffn_combine (vllm-project#5787)
vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>
vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>
vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>
vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
vllm-project#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>
What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format.
Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com
Does this PR introduce any user-facing change?
No
How was this patch tested?