Skip to content

Conversation

@jixiongdeng
Copy link
Contributor

@jixiongdeng jixiongdeng commented Nov 20, 2025

Problem

This is a refactored PR from previous shared emb PR.

The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc.

Solution

Calculated flat_dim in a more generic way on reshape node before GatherBlockQuantized (support 4bit and 8bit).
Added CUDA kernel support in ORT #26484.
Added more extra_options to enable different quant configs and pack options, and shared embeddings.

Running examples:

shared 4 bit k_quant on Phi-4-Mini Instruct:

python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant

shared 16 bit float emb on Phi-4-Mini Instruct:

python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true

Changes

Modified Files

  • src/python/py/models/builder.py
  • src/python/py/models/builders/base.py
  • src/python/py/models/README.MD

Key Modifications

  1. Computed flat_dim in a generic manner before feeding in GatherBlockQuantized.
  2. Explicitly defined gather_axis and quantize_axis for clarity.
  3. Added shared_embeddings option to tied embed_tokens/lm_head.
  4. Added rtn_last like k_quant_last as a new mixed precision option
  5. Added k_quant like rtn as a new 4 bit quantizer option
  6. Removed int4_tied_embeddings and merged to shared_embeddings.
  7. Added documents.

@jixiongdeng
Copy link
Contributor Author

@kunal-vaishnavi Thanks for the review! I updated this PR par your suggestions. All checks passed. PTAL.

@kunal-vaishnavi kunal-vaishnavi enabled auto-merge (squash) November 26, 2025 02:04
@kunal-vaishnavi kunal-vaishnavi merged commit f748f24 into main Nov 26, 2025
15 checks passed
@kunal-vaishnavi kunal-vaishnavi deleted the jd/shared_emb branch November 26, 2025 02:04
tianleiwu pushed a commit that referenced this pull request Nov 26, 2025
…hoice (#1893)

## Problem
As we discussed in [this
PR](#1885), I
separate `disable_qkv_fusion` option as a new PR.

The current model builder ties q_proj, k_proj and v_proj together as
qkv_proj by default, which is not controllable by upstream quantization
choice.

## Solution

Added `disable_qkv_fusion` in extra_options to override
`attention_attrs["use_packed_matmul"]`.

Running examples:

**untied qvk_projs for 4 bit rtn on Llama-3.2-3B-Instruct**:
```
python src/python/py/models/builder.py -m meta-llama/Llama-3.2-3B-Instruct -p int4 -e cuda -o export_model/llama32_3bi_rtn_u4_untied_qkv --extra_options int4_algo_config=rtn disable_qkv_fusion=true
```

## Changes

### Modified Files
- `src/python/py/models/builder.py`
- `src/python/py/models/builders/base.py`
- `src/python/py/models/README.MD`

### Key Modifications
1. Added `disable_qkv_fusion` as a part of assigning logic of
`attention_attrs["use_packed_matmul"]`.
2. Added documents.
kunal-vaishnavi pushed a commit that referenced this pull request Dec 5, 2025
## Problem
This is a refactored PR from [previous shared emb
PR](#1854).

The current model builder doesn't support shared embeddings layers with
4bit qweights and 16bit float weights, which occupies more room in disk
(unnecessary for originally tied embeddings models) and hurts
compression rate for quantized models. builder.py doesn't provide
flexible options to toggle the graph construction and quantization
config, like rtn, kquant, etc.

## Solution

Calculated flat_dim in a more generic way on reshape node before
`GatherBlockQuantized` (support 4bit and 8bit).
Added CUDA kernel support in ORT
[#26484](microsoft/onnxruntime#26484).
Added more extra_options to enable different quant configs and pack
options, and shared embeddings.

Running examples:

**shared 4 bit k_quant on Phi-4-Mini Instruct**:
```
python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant
```

**shared 16 bit float emb on Phi-4-Mini Instruct**:
```
python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true
```

## Changes

### Modified Files
- `src/python/py/models/builder.py`
- `src/python/py/models/builders/base.py`
- `src/python/py/models/README.MD`

### Key Modifications
1. Computed `flat_dim` in a generic manner before feeding in
`GatherBlockQuantized`.
2. Explicitly defined gather_axis and quantize_axis for clarity.
3. Added `shared_embeddings` option to tied embed_tokens/lm_head.
4. Added `rtn_last` like `k_quant_last` as a new mixed precision option
5. Added `k_quant` like `rtn` as a new 4 bit quantizer option
6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`.
7. Added documents.
kunal-vaishnavi pushed a commit that referenced this pull request Dec 5, 2025
…hoice (#1893)

## Problem
As we discussed in [this
PR](#1885), I
separate `disable_qkv_fusion` option as a new PR.

The current model builder ties q_proj, k_proj and v_proj together as
qkv_proj by default, which is not controllable by upstream quantization
choice.

## Solution

Added `disable_qkv_fusion` in extra_options to override
`attention_attrs["use_packed_matmul"]`.

Running examples:

**untied qvk_projs for 4 bit rtn on Llama-3.2-3B-Instruct**:
```
python src/python/py/models/builder.py -m meta-llama/Llama-3.2-3B-Instruct -p int4 -e cuda -o export_model/llama32_3bi_rtn_u4_untied_qkv --extra_options int4_algo_config=rtn disable_qkv_fusion=true
```

## Changes

### Modified Files
- `src/python/py/models/builder.py`
- `src/python/py/models/builders/base.py`
- `src/python/py/models/README.MD`

### Key Modifications
1. Added `disable_qkv_fusion` as a part of assigning logic of
`attention_attrs["use_packed_matmul"]`.
2. Added documents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants