-
Notifications
You must be signed in to change notification settings - Fork 253
Generic shared emb_tokens/lm_head implementation #1885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Author
|
@kunal-vaishnavi Thanks for the review! I updated this PR par your suggestions. All checks passed. PTAL. |
kunal-vaishnavi
approved these changes
Nov 26, 2025
tianleiwu
pushed a commit
that referenced
this pull request
Nov 26, 2025
…hoice (#1893) ## Problem As we discussed in [this PR](#1885), I separate `disable_qkv_fusion` option as a new PR. The current model builder ties q_proj, k_proj and v_proj together as qkv_proj by default, which is not controllable by upstream quantization choice. ## Solution Added `disable_qkv_fusion` in extra_options to override `attention_attrs["use_packed_matmul"]`. Running examples: **untied qvk_projs for 4 bit rtn on Llama-3.2-3B-Instruct**: ``` python src/python/py/models/builder.py -m meta-llama/Llama-3.2-3B-Instruct -p int4 -e cuda -o export_model/llama32_3bi_rtn_u4_untied_qkv --extra_options int4_algo_config=rtn disable_qkv_fusion=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Added `disable_qkv_fusion` as a part of assigning logic of `attention_attrs["use_packed_matmul"]`. 2. Added documents.
kunal-vaishnavi
pushed a commit
that referenced
this pull request
Dec 5, 2025
## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.
kunal-vaishnavi
pushed a commit
that referenced
this pull request
Dec 5, 2025
…hoice (#1893) ## Problem As we discussed in [this PR](#1885), I separate `disable_qkv_fusion` option as a new PR. The current model builder ties q_proj, k_proj and v_proj together as qkv_proj by default, which is not controllable by upstream quantization choice. ## Solution Added `disable_qkv_fusion` in extra_options to override `attention_attrs["use_packed_matmul"]`. Running examples: **untied qvk_projs for 4 bit rtn on Llama-3.2-3B-Instruct**: ``` python src/python/py/models/builder.py -m meta-llama/Llama-3.2-3B-Instruct -p int4 -e cuda -o export_model/llama32_3bi_rtn_u4_untied_qkv --extra_options int4_algo_config=rtn disable_qkv_fusion=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Added `disable_qkv_fusion` as a part of assigning logic of `attention_attrs["use_packed_matmul"]`. 2. Added documents.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
This is a refactored PR from previous shared emb PR.
The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc.
Solution
Calculated flat_dim in a more generic way on reshape node before
GatherBlockQuantized(support 4bit and 8bit).Added CUDA kernel support in ORT #26484.
Added more extra_options to enable different quant configs and pack options, and shared embeddings.
Running examples:
shared 4 bit k_quant on Phi-4-Mini Instruct:
shared 16 bit float emb on Phi-4-Mini Instruct:
Changes
Modified Files
src/python/py/models/builder.pysrc/python/py/models/builders/base.pysrc/python/py/models/README.MDKey Modifications
flat_dimin a generic manner before feeding inGatherBlockQuantized.shared_embeddingsoption to tied embed_tokens/lm_head.rtn_lastlikek_quant_lastas a new mixed precision optionk_quantlikertnas a new 4 bit quantizer optionint4_tied_embeddingsand merged toshared_embeddings.