-
Notifications
You must be signed in to change notification settings - Fork 254
Shared emb_tokens/lm_head on fp16 & uint4 weights #1854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…im for reshape of packed weights
@microsoft-github-policy-service agree company="Microsoft" |
src/python/py/models/builder.py
Outdated
| ) | ||
|
|
||
| # Allow extra_options to override use_packed_matmul | ||
| if "unpack_matmul" in extra_options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user. You can see the review comments on this PR for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user. You can see the review comments on this PR for more details.
Thanks for the details. We attempt to deliver fine tuned weights on open sources models like llama3.2 and qwen3. These models are not originally with packed qkv_proj. We fine tuned these models with unpacked projs which might be better to deliver these weights with consistency of their original forms. The unpack option is False at default until we add it in extra_options. It would be great if we have this option for development.
|
|
||
| elif quant_method in {"k_quant_mixed", "k_quant_last"}: | ||
| elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}: | ||
| from onnxruntime.quantization.matmul_nbits_quantizer import KQuantWeightOnlyQuantConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this import up. It was previously here because it was not part of a stable release.
onnxruntime-genai/src/python/py/models/builder.py
Lines 24 to 28 in d4eabac
| from onnxruntime.quantization.matmul_nbits_quantizer import ( | |
| MatMulNBitsQuantizer, | |
| QuantFormat, | |
| RTNWeightOnlyQuantConfig, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this import up. It was previously here because it was not part of a stable release.
onnxruntime-genai/src/python/py/models/builder.py
Lines 24 to 28 in d4eabac
from onnxruntime.quantization.matmul_nbits_quantizer import ( MatMulNBitsQuantizer, QuantFormat, RTNWeightOnlyQuantConfig, )
The current build checks are still using ort 1.22 which will block this PR if I move the import on the top. Would be better to change it after updating check tests.
|
|
||
| if quant_method == "rtn": | ||
| int4_algo_config = RTNWeightOnlyQuantConfig() | ||
| if quant_method in {"rtn", "rtn_last"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to the following.
if quant_method in {"rtn", "rtn_last"}:
if quant_method == "rtn_last":
customized_weight_config["/lm_head/MatMul"] = {"bits": 8}
int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to the following.
if quant_method in {"rtn", "rtn_last"}: if quant_method == "rtn_last": customized_weight_config["/lm_head/MatMul"] = {"bits": 8} int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)
Done.
| int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config) | ||
|
|
||
| elif quant_method in {"k_quant_mixed", "k_quant_last"}: | ||
| elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to the following.
elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}:
if quant_method != "k_quant":
customized_weight_config["/lm_head/MatMul"] = {"bits": 8}
if quant_method == "k_quant_mixed":
# k_quant_mixed is from llama.cpp.
# Reference: https://github.com/ggml-org/llama.cpp/blob/36667c8edcded08063ed51c7d57e9e086bbfc903/src/llama-quant.cpp#L136
# We also consider some MatMuls are more senstive to quantization than other MatMuls.
layers_to_exclude = [
i
for i in range(self.num_layers)
if i < self.num_layers / 8 or i >= 7 * self.num_layers / 8 or (i - (round)(self.num_layers / 8)) % 3 == 2
]
for i in layers_to_exclude:
customized_weight_config["/model/layers." + str(i) + "/attn/qkv_proj/MatMul"] = {"bits": 8}
customized_weight_config["/model/layers." + str(i) + "/attn/v_proj/MatMul"] = {"bits": 8}
customized_weight_config["/model/layers." + str(i) + "/mlp/down_proj/MatMul"] = {"bits": 8}
int4_algo_config = KQuantWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to the following.
elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}: if quant_method != "k_quant": customized_weight_config["/lm_head/MatMul"] = {"bits": 8} if quant_method == "k_quant_mixed": # k_quant_mixed is from llama.cpp. # Reference: https://github.com/ggml-org/llama.cpp/blob/36667c8edcded08063ed51c7d57e9e086bbfc903/src/llama-quant.cpp#L136 # We also consider some MatMuls are more senstive to quantization than other MatMuls. layers_to_exclude = [ i for i in range(self.num_layers) if i < self.num_layers / 8 or i >= 7 * self.num_layers / 8 or (i - (round)(self.num_layers / 8)) % 3 == 2 ] for i in layers_to_exclude: customized_weight_config["/model/layers." + str(i) + "/attn/qkv_proj/MatMul"] = {"bits": 8} customized_weight_config["/model/layers." + str(i) + "/attn/v_proj/MatMul"] = {"bits": 8} customized_weight_config["/model/layers." + str(i) + "/mlp/down_proj/MatMul"] = {"bits": 8} int4_algo_config = KQuantWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)
Done.
src/python/py/models/builder.py
Outdated
| self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last"} | ||
| if not self.int8_lm_head: | ||
| self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"} | ||
| if not self.int8_lm_head and extra_options.get("int4_algo_config", "default") not in {"rtn", "k_quant"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rewrite the above section and the if condition to just match on the conditions needed for tied embeddings to be true and otherwise set it to false?
Something like this:
self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"}
self.int4_tied_embeddings = extra_options.get("int4_tied_embeddings", config.tie_word_embeddings if hasattr(config, "tie_word_embeddings") and config.tie_word_embeddings is not None else False)
# matmul_nbits_quantizer.py has a different naming for default quantization, so lm_head.MatMul.weight_Q{}G{} does not match.
# tied_embeddings lm_head.MatMul.weight_Q{}G{} only works with rtn&k_quant on 4bit
self.int4_tied_embeddings = <boolean expression>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rewrite the above section and the if condition to just match on the conditions needed for tied embeddings to be true and otherwise set it to false?
Something like this:
self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"} self.int4_tied_embeddings = extra_options.get("int4_tied_embeddings", config.tie_word_embeddings if hasattr(config, "tie_word_embeddings") and config.tie_word_embeddings is not None else False) # matmul_nbits_quantizer.py has a different naming for default quantization, so lm_head.MatMul.weight_Q{}G{} does not match. # tied_embeddings lm_head.MatMul.weight_Q{}G{} only works with rtn&k_quant on 4bit self.int4_tied_embeddings = <boolean expression>
Done.
|
Can you update the options for onnxruntime-genai/src/python/py/models/builder.py Lines 4685 to 4688 in d4eabac
|
|
@kunal-vaishnavi Thanks for the review! I edited codes and adapted to most of your comments. |
|
Added an option to create shared emb_tokens/lm_head for float matmul node. |
Added. Done. |
| lm_head_excluded = "/lm_head/MatMul" in self.quant_attrs["int4"]["nodes_to_exclude"] | ||
|
|
||
| self.int4_tied_embeddings = extra_options.get("int4_tied_embeddings", config.tie_word_embeddings if hasattr(config, "tie_word_embeddings") and config.tie_word_embeddings is not None else False) | ||
| self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why int4_algo_config is used to set int8_lm_head here? Our model support different bits for different weights. I thought that it is better to have straight forward setting like weight name to n_bits, or a configuration for lm_head.
| k_quant_last = k_quant algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4. | ||
| int4_tied_embeddings = Enable weight sharing for quantization. Default is false. | ||
| Use this option when you want to share the weights in the embedding and unembedding. | ||
| int4_tied_embeddings = Enable weight sharing for quantized models (INT4/UINT4/INT8/UINT8). Default is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this option?
If we shared embeddings (lm_head), that also means the quantized weights shall be shared.
As long as we know quantization method and number of bits, that will be enough.
|
Closed this PR due to refactor of model builder. |
## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.
## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.
Problem
The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like unpacked/packed matmul, rtn, kquant, etc.
Solution
Calculated flat_dim in a more generic way on reshape node before
GatherBlockQuantized(support 4bit and 8bit).Added CUDA kernel support in ORT #26484.
Added more extra_options to enable different quant configs and pack options, and shared embeddings.
Running examples:
unpacked qkv_projs and shared 4 bit RTN on Llama3.2 1B Instruct:
shared 4 bit k_quant on Phi-4-Mini Instruct:
shared 16 bit float emb on Phi-4-Mini Instruct:
Changes
Modified Files
src/python/py/models/builder.pyKey Modifications
flat_dimin a generic manner before feeding inGatherBlockQuantized.unpack_matmuloption to separate qvk_proj if needed.shared_embeddingsoption to tied embed_tokens/lm_head.rtn_lastlikek_quant_lastas a new mixed precision optionk_quantlikertnas a new 4 bit quantizer option