-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[CUDA] Correct after_gather_dim for nibbled uint4 index #26484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the kernel indexing calculation for packed uint8_t data with bits < 8 in the GatherBlockQuantized operation. When sub-byte quantization is used (e.g., 4-bit values packed into uint8_t), the output dimensions are expanded to account for unpacking, but the after_gather_dim parameter passed to the kernel was not adjusted accordingly, leading to incorrect indexing.
- Introduced calculation for
after_gather_dim_unpackedthat accounts for packed data expansion when using sub-8-bit quantization with uint8_t - Updated the kernel parameter to use the unpacked dimension value for correct indexing in the CUDA kernel
Comments suppressed due to low confidence (1)
onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc:1
- The kernel uses
after_gather_dim(unpacked value) for indexing into the output, but constructsin_idxfor the input data which is still packed. When T1 is uint8_t with bits < 8, the input data is packed, soin_idxshould be computed using the original packedafter_gather_dimvalue, not the unpacked one. This mismatch could cause incorrect memory access when reading from the packed input data.
// Copyright (c) Microsoft Corporation. All rights reserved.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Please add a test case for this. I noticed that CUDA has disabled a test case of 4 bits:
For reference, here is AI's analysis of this code change: Looks good. This change is a necessary fix for correct indexing when using Here's a breakdown of the review: Summary of the ChangeThis PR modifies the Specifically, if the input data type Analysis
Conclusion: This is a correct and necessary fix. The change is clear, well-commented, and aligns with the existing logic for handling packed |
|
@tianleiwu @kunal-vaishnavi Added testcases. Please check. |
|
Related tests are built successfully: and passed: |
## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.
) ### Description The after_gather_dim in CUDA backend now only supports uint8 dtype. This PR ensures indexing matches correctly in gather_block_quantized with nibbled 4bits weights. ### Motivation and Context This allows token_embeddings and lm_head tied in 4bit weights, which saves more room and compresses models further.
## Problem This is a refactored PR from [previous shared emb PR](#1854). The current model builder doesn't support shared embeddings layers with 4bit qweights and 16bit float weights, which occupies more room in disk (unnecessary for originally tied embeddings models) and hurts compression rate for quantized models. builder.py doesn't provide flexible options to toggle the graph construction and quantization config, like rtn, kquant, etc. ## Solution Calculated flat_dim in a more generic way on reshape node before `GatherBlockQuantized` (support 4bit and 8bit). Added CUDA kernel support in ORT [#26484](microsoft/onnxruntime#26484). Added more extra_options to enable different quant configs and pack options, and shared embeddings. Running examples: **shared 4 bit k_quant on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false int4_algo_config=k_quant ``` **shared 16 bit float emb on Phi-4-Mini Instruct**: ``` python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p fp16 -e cuda -o export_model/phi4mini_i_fp16_tied --extra_options shared_embeddings=true ``` ## Changes ### Modified Files - `src/python/py/models/builder.py` - `src/python/py/models/builders/base.py` - `src/python/py/models/README.MD` ### Key Modifications 1. Computed `flat_dim` in a generic manner before feeding in `GatherBlockQuantized`. 2. Explicitly defined gather_axis and quantize_axis for clarity. 3. Added `shared_embeddings` option to tied embed_tokens/lm_head. 4. Added `rtn_last` like `k_quant_last` as a new mixed precision option 5. Added `k_quant` like `rtn` as a new 4 bit quantizer option 6. Removed `int4_tied_embeddings` and merged to `shared_embeddings`. 7. Added documents.
Description
The after_gather_dim in CUDA backend now only supports uint8 dtype.
This PR ensures indexing matches correctly in gather_block_quantized with nibbled 4bits weights.
Motivation and Context
This allows token_embeddings and lm_head tied in 4bit weights, which saves more room and compresses models further.