[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights#4516
[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights#4516wangxiyuan merged 23 commits intovllm-project:mainfrom
Conversation
…mi-K2-Thinking quantized experts weights Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request adds support for W4A16 quantization for MoE layers, specifically for Kimi-K2 models. The changes include a new quantization method AscendW4A16FusedMoEMethod, modifications to the MoE MLP logic to handle the new format, and updates to configuration files. Additionally, a bug fix in the rotary embedding implementation is included, which prevents a potential crash. The implementation for W4A16 seems consistent with existing quantization methods for Ascend NPUs. The bug fix is a welcome improvement to robustness.
| if hasattr(self, "cos") and hasattr(self, "sin") and \ | ||
| self.cos is not None and self.sin is not None: |
There was a problem hiding this comment.
This change correctly prevents a potential AttributeError. In the previous implementation, if _rope_forward_oot was called when is_first_layer was False in the calling AscendRotaryEmbedding.forward_oot on its first execution, self.cos and self.sin would not have been initialized, leading to a crash. The addition of hasattr checks ensures the attributes exist before they are accessed, making the code more robust.
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
…ze` attr Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
| def _is_w4a16(self, weight_quant: QuantizationArgs) -> bool: | ||
| is_4_bits = weight_quant.num_bits == 4 | ||
| return is_4_bits |
There was a problem hiding this comment.
W4A16 quantization configuration is incomplete.
- Verify the weight QuantizationArgs strategy.
- Confirm that the activation QuantizationArgs is empty.
| if isinstance(layer, FusedMoE): | ||
| layer.ascend_quant_method = COMPRESSED_TENSORS_METHOD | ||
| # collect schemes | ||
| quant_scheme = self.get_scheme(layer=layer, layer_name=prefix) |
There was a problem hiding this comment.
The target_scheme_map only contains the "Linear" key—how can you obtain a scheme specific to FuseMoE?
What this PR does / why we need it?
Adds W4A16 quantization method for the Kimi-K2-Thinking model and updates relevant modules to support the new quantization method.
use_int4_w4a16,w1_offsetandw2_offset, adjustswith_quantconditional logic to support W4A16 matrix multiplication.packed_modules_model_mappingfor Kimi-K2-Thinking model and processing logic forweight_packedfield.Does this PR introduce any user-facing change?
None.
How was this patch tested?