Skip to content

Conversation

@JyotinderSingh
Copy link
Collaborator

@JyotinderSingh JyotinderSingh commented Jun 29, 2025

Summary

This PR introduces support for int4 weight-only quantization for the Dense layer. The implementation includes the necessary logic for packing and unpacking int4 values, performing the quantized matrix multiplication, and ensuring compatibility with features like LoRA.

The code currently implements W4A8 quantization scheme.

Description

The core changes include:

  • Support for int4 quantization mode.

  • Packing and Unpacking Utilities:

    • pack_int4 takes an int8 tensor (representing int4 values) and packs two 4-bit values into a single int8 byte.
    • unpack_int4 performs the reverse operation, unpacking the int8 tensor back into an int8 tensor of int4 values.
  • Dense Layer Modifications:

    • _int4_build: Builds a packed kernel of int8 dtype and a kernel_scale variable. The original input dimension is saved in _orig_input_dim to handle unpacking correctly.
    • _int4_call: Defines the forward pass for the int4 quantized layer. It uses a custom_gradient to perform the matrix multiplication with the unpacked kernel and correctly computes the gradients with respect to the original inputs.
    • The quantize method now handles mode="int4". It quantizes the float weights to int4 values and then packs them using pack_int4.
    • LoRA Compatibility:
      • The enable_lora method correctly determines the input dimension for the LoRA matrices when the layer is int4 quantized by using the saved _orig_input_dim.
      • The _get_kernel_with_merged_lora method handles the unpacking of the int4 kernel before merging the LoRA weights, followed by re-quantization and re-packing.

Testing

  • Added tests for int4 quantization in dense_test.py. These tests cover basic correctness, serialization (saving/loading models), behavior with LoRA enabled, and various edge cases.
  • Added unit tests for the pack_int4 and unpack_int4 functions in quantizers_test.py to ensure they work correctly for various tensor shapes and axes.

Benchmarking

Note: Results collected with warmed-up GPUs and pre-loaded models and kernels.

  1. Text Generation Micro-Benchmark with OPT 125M using KerasHub: colab notebook
Screenshot 2025-07-15 at 3 25 02 PM
  1. Classifier Micro-Benchmark with DistilBERT (fine-tuned on SST2) using KerasHub: colab notebook
Screenshot 2025-07-15 at 3 25 15 PM

Limitation

The current implementation performs a kernel unpack on every forward-pass (to unpack the int4 kernel from it's packed int8 representation where each byte stores two nibbles). This means that we lose some memory savings at runtime along with some performance penalty.

We may be able to work around this in the future by writing custom kernels which operate directly on the packed int4 representation.

Further work

  1. Exploring calibration methods discussed in AWQ (Activation-aware Weight Quantization) and GPTQ papers which could potentially be used to expose new APIs to allow better inference performance.

@JyotinderSingh JyotinderSingh changed the title [DRAFT] int4 quantization support [DRAFT] Add int4 Quantization Support to Dense Layers and DType Policies Jun 29, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jun 29, 2025

Codecov Report

Attention: Patch coverage is 90.57971% with 13 lines in your changes missing coverage. Please review.

Project coverage is 82.80%. Comparing base (744b8be) to head (98fa1ed).
Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
keras/src/layers/core/dense.py 89.55% 2 Missing and 5 partials ⚠️
keras/src/quantizers/quantizers.py 93.22% 2 Missing and 2 partials ⚠️
keras/api/_tf_keras/keras/quantizers/__init__.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21435      +/-   ##
==========================================
+ Coverage   74.94%   82.80%   +7.86%     
==========================================
  Files         565      565              
  Lines       55224    55505     +281     
  Branches     8610     8662      +52     
==========================================
+ Hits        41386    45962    +4576     
+ Misses      11880     7429    -4451     
- Partials     1958     2114     +156     
Flag Coverage Δ
keras 82.61% <90.57%> (+7.84%) ⬆️
keras-jax 63.39% <87.68%> (+0.06%) ⬆️
keras-numpy 58.60% <73.18%> (?)
keras-openvino 33.71% <8.69%> (?)
keras-tensorflow 63.84% <90.57%> (+0.10%) ⬆️
keras-torch 63.51% <87.68%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JyotinderSingh JyotinderSingh changed the title [DRAFT] Add int4 Quantization Support to Dense Layers and DType Policies [DRAFT] Add int4 Quantization Support to Dense Layer Jun 29, 2025
@gbaned gbaned requested a review from mattdangerw June 30, 2025 08:18
@gbaned gbaned added this to PR Queue Jun 30, 2025
@github-project-automation github-project-automation bot moved this to Assigned Reviewer in PR Queue Jun 30, 2025
@JyotinderSingh JyotinderSingh changed the title [DRAFT] Add int4 Quantization Support to Dense Layer [DRAFT - DO NOT REVIEW] Add int4 Quantization Support to Dense Layer Jun 30, 2025
@JyotinderSingh JyotinderSingh changed the title [DRAFT - DO NOT REVIEW] Add int4 Quantization Support to Dense Layer [DRAFT] Add int4 Quantization Support to Dense Layer Jun 30, 2025
@JyotinderSingh JyotinderSingh changed the title [DRAFT] Add int4 Quantization Support to Dense Layer Add int4 Quantization Support to Dense Layer Jul 1, 2025
@JyotinderSingh JyotinderSingh changed the title Add int4 Quantization Support to Dense Layer Add int4 Quantization Support Jul 1, 2025
Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The code generally looks good to me. What is the performance profile? How did you benchmark the change?

@JyotinderSingh
Copy link
Collaborator Author

Thanks for the PR! The code generally looks good to me. What is the performance profile? How did you benchmark the change?

I hadn't yet benchmarked the code. I've now created two micro-benchmarks and have linked them in the PR description, please take a look!

@JyotinderSingh
Copy link
Collaborator Author

JyotinderSingh commented Jul 10, 2025

There was some issue with the original benchmarking script. I've fixed it, and now we're seeing significantly better results for GPU memory usage.

  1. Text Generation Micro-Benchmark with OPT 125M using KerasHub: colab notebook
opt_tf_gpu
  1. Classifier Micro-Benchmark with DistilBERT (fine-tuned on SST2) using KerasHub: colab notebook
distilbert_tf_gpu

We can expect further speedups once we support quantization in more layers.

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The code looks good. The new results look reasonable.

@JyotinderSingh JyotinderSingh requested a review from fchollet July 10, 2025 23:37
@divyashreepathihalli
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for int4 quantization in the Dense layer, including packing/unpacking utilities and LoRA compatibility. The changes are well-structured, and the addition of int4 quantization is a valuable enhancement.

@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Jul 11, 2025
@github-project-automation github-project-automation bot moved this from Assigned Reviewer to Approved by Reviewer in PR Queue Jul 11, 2025
@fchollet fchollet merged commit 89d953e into keras-team:master Jul 11, 2025
10 checks passed
@github-project-automation github-project-automation bot moved this from Approved by Reviewer to Merged in PR Queue Jul 11, 2025
@google-ml-butler google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Jul 11, 2025
@JyotinderSingh JyotinderSingh deleted the int4_quantization branch July 11, 2025 17:40
ryantqiu pushed a commit to snorkel-marlin-repos/keras-team_keras_pr_21435_805642a0-a181-4e29-8ae6-da45d62f5235 that referenced this pull request Oct 1, 2025
Original PR #21435 by JyotinderSingh
Original: keras-team/keras#21435
ryantqiu added a commit to snorkel-marlin-repos/keras-team_keras_pr_21435_805642a0-a181-4e29-8ae6-da45d62f5235 that referenced this pull request Oct 1, 2025
ryantqiu pushed a commit to snorkel-marlin-repos/keras-team_keras_pr_21435_09a834a7-1077-435d-a909-d9f3451bb1f0 that referenced this pull request Oct 2, 2025
Original PR #21435 by JyotinderSingh
Original: keras-team/keras#21435
ryantqiu added a commit to snorkel-marlin-repos/keras-team_keras_pr_21435_09a834a7-1077-435d-a909-d9f3451bb1f0 that referenced this pull request Oct 2, 2025
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/keras-team_keras_pr_21435_58588082-aeb8-4a26-95f8-c66865366dba that referenced this pull request Oct 2, 2025
Original PR #21435 by JyotinderSingh
Original: keras-team/keras#21435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Merged

Development

Successfully merging this pull request may close these issues.

6 participants