Shared emb_tokens/lm_head on fp16 & uint4 weights #1854

jixiongdeng · 2025-11-03T22:55:50Z

Problem

The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like unpacked/packed matmul, rtn, kquant, etc.

Solution

Calculated flat_dim in a more generic way on reshape node before GatherBlockQuantized (support 4bit and 8bit).
Added CUDA kernel support in ORT #26484.
Added more extra_options to enable different quant configs and pack options, and shared embeddings.

Running examples:
unpacked qkv_projs and shared 4 bit RTN on Llama3.2 1B Instruct:

python src/python/py/models/builder.py -m meta-llama/Llama-3.2-1B-Instruct -p int4 -e cuda -o export_model/llama32_1bi_rtn_4_4_unpacked_tied --extra_options int4_is_symmetric=false unpack_matmul=true int4_algo_config=rtn

shared 4 bit k_quant on Phi-4-Mini Instruct:

python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false unpack_matmul=true int4_algo_config=k_quant

shared 16 bit float emb on Phi-4-Mini Instruct:

python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true

Changes

Modified Files

src/python/py/models/builder.py

Key Modifications

Computed flat_dim in a generic manner before feeding in GatherBlockQuantized.
Explicitly defined gather_axis and quantize_axis for clarity.
Added unpack_matmul option to separate qvk_proj if needed.
Added shared_embeddings option to tied embed_tokens/lm_head.
Added rtn_last like k_quant_last as a new mixed precision option
Added k_quant like rtn as a new 4 bit quantizer option

…im for reshape of packed weights

…itions

jixiongdeng · 2025-11-03T23:01:44Z

@jixiongdeng please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

kunal-vaishnavi · 2025-11-07T07:05:04Z

src/python/py/models/builder.py

            )

+            # Allow extra_options to override use_packed_matmul
+            if "unpack_matmul" in extra_options:


This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user. You can see the review comments on this PR for more details.

This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user. You can see the review comments on this PR for more details.

Thanks for the details. We attempt to deliver fine tuned weights on open sources models like llama3.2 and qwen3. These models are not originally with packed qkv_proj. We fine tuned these models with unpacked projs which might be better to deliver these weights with consistency of their original forms. The unpack option is False at default until we add it in extra_options. It would be great if we have this option for development.

kunal-vaishnavi · 2025-11-07T07:21:19Z

src/python/py/models/builder.py


-        elif quant_method in {"k_quant_mixed", "k_quant_last"}:
+        elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}:
            from onnxruntime.quantization.matmul_nbits_quantizer import KQuantWeightOnlyQuantConfig


Let's move this import up. It was previously here because it was not part of a stable release.

onnxruntime-genai/src/python/py/models/builder.py

Lines 24 to 28 in d4eabac

from onnxruntime.quantization.matmul_nbits_quantizer import (

MatMulNBitsQuantizer,

QuantFormat,

RTNWeightOnlyQuantConfig,

)

Let's move this import up. It was previously here because it was not part of a stable release.

onnxruntime-genai/src/python/py/models/builder.py

Lines 24 to 28 in d4eabac

from onnxruntime.quantization.matmul_nbits_quantizer import (

MatMulNBitsQuantizer,

QuantFormat,

RTNWeightOnlyQuantConfig,

)

The current build checks are still using ort 1.22 which will block this PR if I move the import on the top. Would be better to change it after updating check tests.

kunal-vaishnavi · 2025-11-07T07:29:14Z

src/python/py/models/builder.py


-        if quant_method == "rtn":
-            int4_algo_config = RTNWeightOnlyQuantConfig()
+        if quant_method in {"rtn", "rtn_last"}:


I think this can be simplified to the following.

if quant_method in {"rtn", "rtn_last"}: if quant_method == "rtn_last": customized_weight_config["/lm_head/MatMul"] = {"bits": 8} int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

I think this can be simplified to the following.

if quant_method in {"rtn", "rtn_last"}: if quant_method == "rtn_last": customized_weight_config["/lm_head/MatMul"] = {"bits": 8} int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

Done.

kunal-vaishnavi · 2025-11-07T07:36:18Z

src/python/py/models/builder.py

+                int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

-        elif quant_method in {"k_quant_mixed", "k_quant_last"}:
+        elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}:


I think this can be simplified to the following.

elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}: if quant_method != "k_quant": customized_weight_config["/lm_head/MatMul"] = {"bits": 8} if quant_method == "k_quant_mixed": # k_quant_mixed is from llama.cpp. # Reference: https://github.com/ggml-org/llama.cpp/blob/36667c8edcded08063ed51c7d57e9e086bbfc903/src/llama-quant.cpp#L136 # We also consider some MatMuls are more senstive to quantization than other MatMuls. layers_to_exclude = [ i for i in range(self.num_layers) if i < self.num_layers / 8 or i >= 7 * self.num_layers / 8 or (i - (round)(self.num_layers / 8)) % 3 == 2 ] for i in layers_to_exclude: customized_weight_config["/model/layers." + str(i) + "/attn/qkv_proj/MatMul"] = {"bits": 8} customized_weight_config["/model/layers." + str(i) + "/attn/v_proj/MatMul"] = {"bits": 8} customized_weight_config["/model/layers." + str(i) + "/mlp/down_proj/MatMul"] = {"bits": 8} int4_algo_config = KQuantWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

I think this can be simplified to the following.

elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}: if quant_method != "k_quant": customized_weight_config["/lm_head/MatMul"] = {"bits": 8} if quant_method == "k_quant_mixed": # k_quant_mixed is from llama.cpp. # Reference: https://github.com/ggml-org/llama.cpp/blob/36667c8edcded08063ed51c7d57e9e086bbfc903/src/llama-quant.cpp#L136 # We also consider some MatMuls are more senstive to quantization than other MatMuls. layers_to_exclude = [ i for i in range(self.num_layers) if i < self.num_layers / 8 or i >= 7 * self.num_layers / 8 or (i - (round)(self.num_layers / 8)) % 3 == 2 ] for i in layers_to_exclude: customized_weight_config["/model/layers." + str(i) + "/attn/qkv_proj/MatMul"] = {"bits": 8} customized_weight_config["/model/layers." + str(i) + "/attn/v_proj/MatMul"] = {"bits": 8} customized_weight_config["/model/layers." + str(i) + "/mlp/down_proj/MatMul"] = {"bits": 8} int4_algo_config = KQuantWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

Done.

kunal-vaishnavi · 2025-11-07T08:07:46Z

src/python/py/models/builder.py

-        self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last"}
-        if not self.int8_lm_head:
+        self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"}
+        if not self.int8_lm_head and extra_options.get("int4_algo_config", "default") not in {"rtn", "k_quant"}:


Can we rewrite the above section and the if condition to just match on the conditions needed for tied embeddings to be true and otherwise set it to false?

Something like this:

self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"} self.int4_tied_embeddings = extra_options.get("int4_tied_embeddings", config.tie_word_embeddings if hasattr(config, "tie_word_embeddings") and config.tie_word_embeddings is not None else False) # matmul_nbits_quantizer.py has a different naming for default quantization, so lm_head.MatMul.weight_Q{}G{} does not match. # tied_embeddings lm_head.MatMul.weight_Q{}G{} only works with rtn&k_quant on 4bit self.int4_tied_embeddings = <boolean expression>

Can we rewrite the above section and the if condition to just match on the conditions needed for tied embeddings to be true and otherwise set it to false?

Something like this:

self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"} self.int4_tied_embeddings = extra_options.get("int4_tied_embeddings", config.tie_word_embeddings if hasattr(config, "tie_word_embeddings") and config.tie_word_embeddings is not None else False) # matmul_nbits_quantizer.py has a different naming for default quantization, so lm_head.MatMul.weight_Q{}G{} does not match. # tied_embeddings lm_head.MatMul.weight_Q{}G{} only works with rtn&k_quant on 4bit self.int4_tied_embeddings = <boolean expression>

Done.

kunal-vaishnavi · 2025-11-07T08:08:58Z

Can you update the options for int4_algo_config here and add their descriptions?

onnxruntime-genai/src/python/py/models/builder.py

Lines 4685 to 4688 in d4eabac

    
                           int4_algo_config = Method for int4 quantization. Default is 'default'. 
        
                               Currently supported options are: 'default', 'rtn', 'k_quant_mixed', 'k_quant_last'. 
        
                               k_quant_mixed = k_quant algorithm with mixed precision (int4 + int8). 
        
                               k_quant_last = k_quant algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.

…ized emb

jixiongdeng · 2025-11-12T02:01:33Z

@kunal-vaishnavi Thanks for the review! I edited codes and adapted to most of your comments.
The unpack_matmul option relates to my other project which involves weights replacements.
i.e. Llama3.2 series models originally have q_proj, k_proj, v_proj separately . It would be great, if builder.py can provide an easy option to keep consistent with torch models' projs.

jixiongdeng · 2025-11-12T02:06:16Z

Added an option to create shared emb_tokens/lm_head for float matmul node.
This would benefit small models with smaller size, especially for low-bit quantized models.
i.e. phi4mini fp16: (8.3G -> 7.2G) ~ 13.2% smaller
phitmini int4 + fp16 emb_tokens/lm_head: (4.1G -> 2.9G) ~ 29.3% smaller

… ort1.22

jixiongdeng · 2025-11-13T21:03:06Z

Can you update the options for int4_algo_config here and add their descriptions?

onnxruntime-genai/src/python/py/models/builder.py

Lines 4685 to 4688 in d4eabac

int4_algo_config = Method for int4 quantization. Default is 'default'.

Currently supported options are: 'default', 'rtn', 'k_quant_mixed', 'k_quant_last'.

k_quant_mixed = k_quant algorithm with mixed precision (int4 + int8).

k_quant_last = k_quant algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.

Added. Done.

tianleiwu · 2025-11-17T21:57:13Z

src/python/py/models/builder.py

+        lm_head_excluded = "/lm_head/MatMul" in self.quant_attrs["int4"]["nodes_to_exclude"]
+
+        self.int4_tied_embeddings = extra_options.get("int4_tied_embeddings", config.tie_word_embeddings if hasattr(config, "tie_word_embeddings") and config.tie_word_embeddings is not None else False)
+        self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"}


why int4_algo_config is used to set int8_lm_head here? Our model support different bits for different weights. I thought that it is better to have straight forward setting like weight name to n_bits, or a configuration for lm_head.

tianleiwu · 2025-11-17T22:21:35Z

src/python/py/models/builder.py

                    k_quant_last = k_quant algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.
-                int4_tied_embeddings = Enable weight sharing for quantization. Default is false.
-                    Use this option when you want to share the weights in the embedding and unembedding.
+                int4_tied_embeddings = Enable weight sharing for quantized models (INT4/UINT4/INT8/UINT8). Default is false.


Do we need this option?

If we shared embeddings (lm_head), that also means the quantized weights shall be shared.

As long as we know quantization method and number of bits, that will be enough.

jixiongdeng · 2025-11-20T02:40:53Z

Closed this PR due to refactor of model builder.
Avoid massive unnecessary rebase.
New PR at: #1885

## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.

jixiongdeng added 4 commits November 3, 2025 22:09

Added unpack matmul option in extra

4538bb3

Added 4bit shared emb for GatherBlockQuantized & updated an generic d…

6207e1f

…im for reshape of packed weights

Restrict the shared 4bit emb only work on & comments

958574d

Added rtn mixed precision quant options and managed tied 4b embs cond…

b75bb1e

…itions

jixiongdeng requested review from chenfucn, jiafatom, kunal-vaishnavi and tianleiwu November 3, 2025 22:57

jixiongdeng requested a review from jambayk November 3, 2025 23:27

kunal-vaishnavi reviewed Nov 7, 2025

View reviewed changes

jixiongdeng added 3 commits November 11, 2025 23:50

Updated check for gatherblockquantized creation: only apply for quant…

76113df

…ized emb

Updated an option with shared fp emb_tokens/lm_head

37cbec1

Changes based on most of advices and updated comments

033923b

Moved inline import of src/python/py/models/builder.py back to adatpt…

27ecca4

… ort1.22

jixiongdeng changed the title ~~Shared emb_tokens/lm_head on nibbled 4bit qweights~~ Shared emb_tokens/lm_head on fp16 & uint4 weights Nov 13, 2025

jixiongdeng requested a review from kunal-vaishnavi November 13, 2025 21:05

tianleiwu reviewed Nov 17, 2025

View reviewed changes

Fixed missing reference to Model object

3046400

jixiongdeng added a commit that referenced this pull request Nov 20, 2025

Immigrant from prev-refactor PR: #1854

49d26b3

jixiongdeng mentioned this pull request Nov 20, 2025

Generic shared emb_tokens/lm_head implementation #1885

Merged

jixiongdeng closed this Nov 20, 2025

jixiongdeng deleted the jdeng/shared_4bit_emb branch November 20, 2025 02:42

	from onnxruntime.quantization.matmul_nbits_quantizer import (
	MatMulNBitsQuantizer,
	QuantFormat,
	RTNWeightOnlyQuantConfig,
	)

Shared emb_tokens/lm_head on fp16 & uint4 weights #1854

Shared emb_tokens/lm_head on fp16 & uint4 weights #1854

Uh oh!

Conversation

jixiongdeng commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Modified Files

Key Modifications

Uh oh!

jixiongdeng commented Nov 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi commented Nov 7, 2025

Uh oh!

jixiongdeng commented Nov 12, 2025

Uh oh!

jixiongdeng commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jixiongdeng commented Nov 13, 2025

Uh oh!

tianleiwu Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jixiongdeng commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jixiongdeng commented Nov 3, 2025 •

edited

Loading

jixiongdeng commented Nov 12, 2025 •

edited

Loading

tianleiwu Nov 17, 2025 •

edited

Loading