[CUDA] Correct after_gather_dim for nibbled uint4 index #26484

jixiongdeng · 2025-11-03T22:35:24Z

Description

The after_gather_dim in CUDA backend now only supports uint8 dtype.
This PR ensures indexing matches correctly in gather_block_quantized with nibbled 4bits weights.

Motivation and Context

This allows token_embeddings and lm_head tied in 4bit weights, which saves more room and compresses models further.

…rnel indexing

Copilot

Pull Request Overview

This PR fixes the kernel indexing calculation for packed uint8_t data with bits < 8 in the GatherBlockQuantized operation. When sub-byte quantization is used (e.g., 4-bit values packed into uint8_t), the output dimensions are expanded to account for unpacking, but the after_gather_dim parameter passed to the kernel was not adjusted accordingly, leading to incorrect indexing.

Introduced calculation for after_gather_dim_unpacked that accounts for packed data expansion when using sub-8-bit quantization with uint8_t
Updated the kernel parameter to use the unpacked dimension value for correct indexing in the CUDA kernel

Comments suppressed due to low confidence (1)

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc:1

The kernel uses after_gather_dim (unpacked value) for indexing into the output, but constructs in_idx for the input data which is still packed. When T1 is uint8_t with bits < 8, the input data is packed, so in_idx should be computed using the original packed after_gather_dim value, not the unpacked one. This mismatch could cause incorrect memory access when reading from the packed input data.

// Copyright (c) Microsoft Corporation. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc

tianleiwu · 2025-11-03T23:26:47Z

Please add a test case for this.

I noticed that CUDA has disabled a test case of 4 bits:

onnxruntime/onnxruntime/test/contrib_ops/gather_block_quantized_op_test.cc

Line 510 in edfb6f5

#ifndef USE_CUDA

For reference, here is AI's analysis of this code change:

Looks good. This change is a necessary fix for correct indexing when using uint8_t to store packed 4-bit data.

Here's a breakdown of the review:

Summary of the Change

This PR modifies the ComputeInternal function in gather_block_quantized.cc. It introduces after_gather_dim_unpacked to correctly calculate the after_gather_dim parameter that is passed to the CUDA kernel.

Specifically, if the input data type T1 is uint8_t and bits is less than 8 (e.g., 4), it means uint8_t is being used as a container for multiple packed values (e.g., two 4-bit "nibbles").

Analysis

Problem: The GatherBlockQuantizedKernel in gather_block_quantized.cu (line 51) calculates an index out_idx based on the output tensor. It then uses after_gather_dim (line 56) to map this output index back to the corresponding input element.
Shape Mismatch: The ComputeInternal function already correctly calculates the output shape by expanding the last dimension by components (lines 99-102) to account for the unpacked data. However, the after_gather_dim was previously calculated based on the packed input data shape.
The Fix: This mismatch causes incorrect indexing in the kernel. The kernel's after_gather_dim parameter must be based on the unpacked output tensor's dimensions, because out_idx iterates over the unpacked output. This change correctly multiplies after_gather_dim by the number of components, ensuring the indexing logic in the kernel is sound.
Consistency:
- This logic is consistent with the operator's shape inference function (provided in the prompt, lines 201-208), which also expands the output dimension by components.
- It is also consistent with the test file (gather_block_quantized_op_test.cc), which includes a PackDataForUint8TypeIfNecessary function (line 35) and specific tests for 4-bit and 8-bit uint8_t data (e.g., Test_GatherAxis0_WithZeroPoints_Uint8), confirming this packing-aware logic is required.

Conclusion: This is a correct and necessary fix. The change is clear, well-commented, and aligns with the existing logic for handling packed uint8_t data in the operator's spec and tests.

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc

…dated comments

jixiongdeng · 2025-11-06T02:13:26Z

@tianleiwu @kunal-vaishnavi Added testcases. Please check.
Including the use cases for shared 4/8bit emb_tokens/lm_head. (the motivation of this PR).
Thanks @xiaomsft for sharing his unmerged test cases https://github.com/xiaomsft/onnxruntime/tree/xiaoh/gather_block_quantized_tests.
Also included UInt4X2 dtype testcases. All dtype should be covered now.

jixiongdeng · 2025-11-06T02:15:16Z

Related tests are built successfully:

[ 88%] Building CXX object CMakeFiles/onnxruntime_provider_test.dir/onnxruntime/onnxruntime/test/contrib_ops/gather_block_quantized_op_test.cc.o
[ 88%] Linking CXX executable onnxruntime_provider_test
/usr/bin/ld: warning: QgemmU8S8KernelAmx.S.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
[100%] Built target onnxruntime_provider_test

and passed:

[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from GatherBlockQuantizedOpTest
[ RUN      ] GatherBlockQuantizedOpTest.GatherAxis0_QuantizedAxis1_Uint8_4Bits_WithZeroPoints
[       OK ] GatherBlockQuantizedOpTest.GatherAxis0_QuantizedAxis1_Uint8_4Bits_WithZeroPoints (2560 ms)
[ RUN      ] GatherBlockQuantizedOpTest.GatherAxis0_QuantizedAxis1_Uint8_8Bits_WithZeroPoints
[       OK ] GatherBlockQuantizedOpTest.GatherAxis0_QuantizedAxis1_Uint8_8Bits_WithZeroPoints (97 ms)
[ RUN      ] GatherBlockQuantizedOpTest.GatherAxisWithZeroPointsNoPading
[       OK ] GatherBlockQuantizedOpTest.GatherAxisWithZeroPointsNoPading (195 ms)
[ RUN      ] GatherBlockQuantizedOpTest.GatherAxisNoPadingUInt8_4Bits
[       OK ] GatherBlockQuantizedOpTest.GatherAxisNoPadingUInt8_4Bits (96 ms)
[ RUN      ] GatherBlockQuantizedOpTest.GatherAxisNoPadingUInt8
[       OK ] GatherBlockQuantizedOpTest.GatherAxisNoPadingUInt8 (100 ms)
[----------] 5 tests from GatherBlockQuantizedOpTest (3049 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (3050 ms total)
[  PASSED  ] 5 tests.

## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.

) ### Description The after_gather_dim in CUDA backend now only supports uint8 dtype. This PR ensures indexing matches correctly in gather_block_quantized with nibbled 4bits weights. ### Motivation and Context This allows token_embeddings and lm_head tied in 4bit weights, which saves more room and compresses models further.

## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.

Matched after_gather_dim in gather_block_quantized to correct CUDA ke…

76d8632

…rnel indexing

jixiongdeng mentioned this pull request Nov 3, 2025

Shared emb_tokens/lm_head on fp16 & uint4 weights microsoft/onnxruntime-genai#1854

Closed

jixiongdeng requested review from chenfucn, jiafatom, kunal-vaishnavi and tianleiwu November 3, 2025 22:56

tianleiwu requested a review from Copilot November 3, 2025 23:10

Copilot AI reviewed Nov 3, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc Show resolved Hide resolved

jixiongdeng requested a review from jambayk November 3, 2025 23:27

kunal-vaishnavi requested a review from xiaomsft November 5, 2025 10:33

kunal-vaishnavi reviewed Nov 5, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc Show resolved Hide resolved

jixiongdeng added 4 commits November 6, 2025 00:25

Added general u8/i4x2 tests from @xiaomsft

4092a09

Added UINT4X2 cases in Test_GatherAxis_WithZeroPoints_NoPading

e37e5b9

Added test for uint8_t with two packed u4.

d9d8a29

Added testcases that mimic usecase for shared emb_tokens/lm_head & Up…

12a8bb2

…dated comments

Limited testcases to CUDA backend

7aa39a2

tianleiwu approved these changes Nov 6, 2025

View reviewed changes

tianleiwu enabled auto-merge (squash) November 6, 2025 20:07

tianleiwu merged commit d7b48f8 into main Nov 6, 2025
103 of 104 checks passed

tianleiwu deleted the jdeng/shared_4bit_emb branch November 6, 2025 20:07

jixiongdeng mentioned this pull request Nov 20, 2025

Generic shared emb_tokens/lm_head implementation microsoft/onnxruntime-genai#1885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Correct after_gather_dim for nibbled uint4 index #26484

[CUDA] Correct after_gather_dim for nibbled uint4 index #26484

Uh oh!

jixiongdeng commented Nov 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tianleiwu commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

jixiongdeng commented Nov 6, 2025

Uh oh!

jixiongdeng commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[CUDA] Correct after_gather_dim for nibbled uint4 index #26484

[CUDA] Correct after_gather_dim for nibbled uint4 index #26484

Uh oh!

Conversation

jixiongdeng commented Nov 3, 2025

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

tianleiwu commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of the Change

Analysis

Uh oh!

Uh oh!

jixiongdeng commented Nov 6, 2025

Uh oh!

jixiongdeng commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tianleiwu commented Nov 3, 2025 •

edited

Loading