[Feature][Quantization] auto_round format add support for regex#24024
[Feature][Quantization] auto_round format add support for regex#24024mgoin merged 14 commits intovllm-project:mainfrom
Conversation
Signed-off-by: n1ck-guo <heng.guo@intel.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for regular expressions in AutoRound's extra_config, which is a valuable feature for defining quantization settings for groups of layers. However, the current implementation has a critical correctness issue where literal layer names can be misinterpreted as regex patterns, potentially leading to incorrect quantization. My review provides a comment with a suggested code change to address this by using a heuristic to differentiate between literal names and regex patterns, which also improves performance.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Heng Guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
|
@mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 could you please help to review this pr. |
yewentao256
left a comment
There was a problem hiding this comment.
Could you introduce more context about this PR?
Eg, which model is using it. Without this PR what it would be like, and with this PR, etc
Also, showing lm_eval result for accuracy and vllm bench for perf is helpful.
|
@yewentao256 Currently, auto_round is implemented for mixed-precision quantization models by saving the full name of all models. In the future, with the support of this pr, we hope to add support for regularization. This pr primarily reads the regularization configuration. All the model quantized by auto_round will use this pr in future. For example, if I use this script to generate a quantized qwen model: from auto_round import AutoRound
model_path = "Qwen/Qwen3-15B-A2B-Base/"
layer_config = {
"self_attn.[koqv]_proj$": {"bits": 8},
}
ar = AutoRound(model=model_path, scheme="W4A16", layer_config=layer_config, iters=1)
ar.quantize_and_save("Qwen3-15B-A2B-Base-vllm-regex-test")This config.json will include a parameter to show that all non-expert linear will fallback to 16 bits. For old version, it should look like this: "quantization_config": {
"autoround_version": "0.8.0.dev",
"bits": 4,
"data_type": "int",
"extra_config": {
"model.layers.0.self_attn.k_proj": {
"bits": 8
},
"model.layers.0.self_attn.o_proj": {
"bits": 8
},
"model.layers.0.self_attn.q_proj": {
"bits": 8
},
"model.layers.0.self_attn.v_proj": {
"bits": 8
},
"model.layers.1.self_attn.k_proj": {
"bits": 8
},
"model.layers.1.self_attn.o_proj": {
"bits": 8
},
"model.layers.1.self_attn.q_proj": {
"bits": 8
},And with the support of this pr, it can be simplified to "quantization_config": {
"autoround_version": "0.8.0.dev",
"bits": 4,
"data_type": "int",
"extra_config": {
"self_attn.[koqv]_proj$": {"bits": 8},
} |
|
This pr will not affect the accuracy of the model, this is our test result:
|
Signed-off-by: n1ck-guo <heng.guo@intel.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 1994 <1994@users.noreply.github.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…-project#24024) Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Heng Guo <heng.guo@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
auto_round format add support for regex
Test Plan
Load auto_round quantized model with extra_config including regular expressions and full name of layers.
Test Result
With the change, each layer of linear that satisfies the regex in extra_config (for example, ".*mlp.down_proj": {"bits": 16}) can obtain the correct bits
Successful load mixed bits quantization model with auto-round quant_method
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.