-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Adds int4 Quantization Support #21435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds int4 Quantization Support #21435
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21435 +/- ##
==========================================
+ Coverage 74.94% 82.80% +7.86%
==========================================
Files 565 565
Lines 55224 55505 +281
Branches 8610 8662 +52
==========================================
+ Hits 41386 45962 +4576
+ Misses 11880 7429 -4451
- Partials 1958 2114 +156
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
410977a to
71c116a
Compare
c1a58b7 to
777b5e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The code generally looks good to me. What is the performance profile? How did you benchmark the change?
I hadn't yet benchmarked the code. I've now created two micro-benchmarks and have linked them in the PR description, please take a look! |
a2715b2 to
052f7b6
Compare
|
There was some issue with the original benchmarking script. I've fixed it, and now we're seeing significantly better results for GPU memory usage.
We can expect further speedups once we support quantization in more layers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The code looks good. The new results look reasonable.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for int4 quantization in the Dense layer, including packing/unpacking utilities and LoRA compatibility. The changes are well-structured, and the addition of int4 quantization is a valuable enhancement.
Original PR #21435 by JyotinderSingh Original: keras-team/keras#21435
Merged from original PR #21435 Original: keras-team/keras#21435
Original PR #21435 by JyotinderSingh Original: keras-team/keras#21435
Merged from original PR #21435 Original: keras-team/keras#21435
Original PR #21435 by JyotinderSingh Original: keras-team/keras#21435


Summary
This PR introduces support for
int4weight-only quantization for theDenselayer. The implementation includes the necessary logic for packing and unpackingint4values, performing the quantized matrix multiplication, and ensuring compatibility with features like LoRA.The code currently implements W4A8 quantization scheme.
Description
The core changes include:
Support for
int4quantization mode.Packing and Unpacking Utilities:
pack_int4takes anint8tensor (representingint4values) and packs two 4-bit values into a singleint8byte.unpack_int4performs the reverse operation, unpacking theint8tensor back into anint8tensor ofint4values.DenseLayer Modifications:_int4_build: Builds a packedkernelofint8dtype and akernel_scalevariable. The original input dimension is saved in_orig_input_dimto handle unpacking correctly._int4_call: Defines the forward pass for theint4quantized layer. It uses acustom_gradientto perform the matrix multiplication with the unpacked kernel and correctly computes the gradients with respect to the original inputs.quantizemethod now handlesmode="int4". It quantizes the float weights toint4values and then packs them usingpack_int4.enable_loramethod correctly determines the input dimension for the LoRA matrices when the layer isint4quantized by using the saved_orig_input_dim._get_kernel_with_merged_loramethod handles the unpacking of theint4kernel before merging the LoRA weights, followed by re-quantization and re-packing.Testing
int4quantization indense_test.py. These tests cover basic correctness, serialization (saving/loading models), behavior with LoRA enabled, and various edge cases.pack_int4andunpack_int4functions inquantizers_test.pyto ensure they work correctly for various tensor shapes and axes.Benchmarking
Note: Results collected with warmed-up GPUs and pre-loaded models and kernels.
Limitation
The current implementation performs a kernel unpack on every forward-pass (to unpack the int4 kernel from it's packed int8 representation where each byte stores two nibbles). This means that we lose some memory savings at runtime along with some performance penalty.
We may be able to work around this in the future by writing custom kernels which operate directly on the packed int4 representation.
Further work