[Quantization] Add field to skip unquantized modules for GPTQ config#25455
[Quantization] Add field to skip unquantized modules for GPTQ config#25455Isotr0py merged 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
There was a problem hiding this comment.
Code Review
This pull request introduces a more robust and cleaner way to handle unquantized modules in GPTQ models by adding the modules_in_block_to_quantize field to the configuration. This replaces the previous hacky approach and resolves existing issues, which is a great improvement for maintainability. The implementation correctly infers the modules to quantize from the model's checkpoint metadata when not explicitly provided. I have one suggestion regarding code duplication between GPTQConfig and GPTQMarlinConfig that would further enhance the maintainability of this new logic.
| def apply_vllm_mapper(self, hf_to_vllm_mapper): | ||
| if self.modules_in_block_to_quantize is not None: | ||
| self.modules_in_block_to_quantize = hf_to_vllm_mapper.apply_list( | ||
| self.modules_in_block_to_quantize) | ||
|
|
||
| def maybe_update_config(self, | ||
| model_name: str, | ||
| revision: Optional[str] = None): | ||
| if self.modules_in_block_to_quantize: | ||
| return | ||
|
|
||
| unquant_dtypes = [torch.float16, torch.bfloat16, torch.float32] | ||
| metadata = get_safetensors_params_metadata(model_name, | ||
| revision=revision) | ||
| quant_layers: set[str] = { | ||
| param_name.rsplit(".", 1)[0] | ||
| for param_name, info in metadata.items() | ||
| if (dtype := info.get('dtype', None)) | ||
| and _SAFETENSORS_TO_TORCH_DTYPE[dtype] not in unquant_dtypes | ||
| } | ||
| self.modules_in_block_to_quantize = list(quant_layers) |
There was a problem hiding this comment.
The methods apply_vllm_mapper and maybe_update_config are identical to the ones in vllm/model_executor/layers/quantization/gptq.py. This code duplication can lead to maintenance issues, where a change in one file might not be propagated to the other, potentially causing bugs.
To improve maintainability, I recommend refactoring this shared logic into a common base class or a mixin. For example, you could create a GPTQConfigMixin class to house these methods, and then have both GPTQConfig and GPTQMarlinConfig inherit from it. This would ensure that the logic for handling modules_in_block_to_quantize is defined in a single place.
|
Can you check and clean up other models that contain |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Removed |
jeejeelee
left a comment
There was a problem hiding this comment.
Thank you for the improvement. Perhaps we can apply this logic to other quantitative methods
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Hi @Isotr0py , the following PR breaks model loading: jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ The model has been generated using Autoround-GPTQ, like the previous ones |
|
Ooops, sorry for breaking this. Let me fix it. 😅 |
|
@Isotr0py Don't worry, development |
|
@JartX I double checked the "modules_in_block_to_quantize": [
[
+++ "self_attn.q_proj",
+++ "self_attn.k_proj",
+++ "self_attn.v_proj",
+++ "self_attn.o_proj",
"linear_attn.in_proj_qkvz",
"linear_attn.in_proj_ba",
"linear_attn.out_proj",With above changes in |
|
@Isotr0py You're right, it works, and Autoround doesn't declare them. Does that mean it used to work “by chance”? I'm going to open an issue in Autoround. |
That's right. The |
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Purpose
Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4andQwen/Qwen3-30B-A3B-GPTQ-Int4have unquantized layers, but there is no conifg to skip creating unquantized weights. And we're using_maybe_ignore_quant_configto patch model implementation in a hacky way.modules_in_block_to_quantizefrom TransformersGPTQConfig(https://huggingface.co/docs/transformers/v4.56.2/en/main_classes/quantization#transformers.GPTQConfig), and automatically infer it from checkpoint metadata. So that we no longer need to do hacky patching for GPTQ models now.Test Plan
Test Result
Qwen3-MoE models with/without MoE gate quantized can all generate normal outputs without GPTQ patch now.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.