Add codebook (look up table based) quantization flow in torchao #1195

jerryzh168 · 2024-10-29T21:52:19Z

Similar to affine quantization, we can implement codebook or look up table based quantization, which is another popular type of quantization, especially for lower bits like 4 bits or below (used in https://github.com/Vahe1994/AQLM, https://arxiv.org/abs/2402.04396 etc.). We can start with post training quantization and use k-means clustering to find the codebook / lookup table. You can check out #391 for the overall structure of torchao stack. Reference code for k-means can be found here.

After this we can also add more support for the advanced algorithms mentioned above.

API

quantize_(model, codebook_weight_only(dtype=torch.uint4))

Implementation details:

[PR1] Ops
- quantize_codebook(tensor, codebook)
- dequantize_codebook(tensor, codebook)
[PR2] Tensor Subclass
- CodebookQuantizedTensor (similar to AffineQuantizedTensor)
  - clustering algorithm can be implemented in from_float function

Needs to flesh out the details of args etc. but can be done in the PR. I'd suggest to gradually add things and gather feedback.

Code Location: add a codebook folder under https://github.com/pytorch/ao/tree/main/torchao/prototype/quantization

Tasks

Give feedback

Initial support [WIP] Codebook quantization flow #1299
Add AQLM support
Currently it's significantly slower compared to other methods, we need to speed it up: https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#codebook-quantization
Options

The text was updated successfully, but these errors were encountered:

malinjawi · 2024-11-01T16:00:08Z

Hey @jerryzh168 I am new to torchao but this sounds like an issue I would want to investigate with my partner @Harthi7. We will take a look and let you know how it goes. Cheers!

DerekLiu35 · 2024-11-16T02:41:22Z

Hi, I am also new to torchao and I would like to do this issue?

…rch#1195) * update flamingo model for tune * 1/n flamingo e2e ppl * flamingo e2e enable * bump up tune version * remove hacky cache size, add comment for magic number * dytpe set for input * manually cast dtype * extra config for deep fusion module

pawarmanasi07 · 2024-12-30T12:56:38Z

Hi! I'm interested in contributing to the implementation of the codebook quantization. Would it be helpful if I worked on [e.g., adding test cases]? Happy to coordinate with @DerekLiu35 to avoid duplicating effort.

DerekLiu35 · 2024-12-30T17:43:50Z

I'd also be happy to coordinate.
I think the main thing to do is add AQLM support (the tuning part, though I'm not sure why it would be beneficial to have the tuning in torchao, compared to just using AQLM repo and then converting it to torchao representation) and making token generation faster (probably by copying dequantization kernels from AQLM)

jerryzh168 · 2024-12-30T17:52:51Z

I think the two immediate things are adding AQLM support and speedup. Adding AQLM in torchao will be a bit more convenient for users compared to using AQLM repo and then convert I think

pawarmanasi07 · 2024-12-30T18:17:28Z

Great! Let me know what I can start with.

DerekLiu35 · 2024-12-30T21:17:01Z

I'll focus on speeding up token generation, can coordinate more if @pawarmanasi07 also wants to work on that.

pawarmanasi07 · 2024-12-31T02:34:50Z

I can help with that!

pawarmanasi07 · 2024-12-31T02:46:19Z

@DerekLiu35 Could you share your thoughts on which aspects of the dequantization kernels from AQLM we should focus on first? We could divide up different parts of the optimization work between us?

DerekLiu35 · 2024-12-31T03:57:58Z

I think we can focus on 1x16 group size cuda kernels and triton (as fallback). we could divide optimization work by one of us focusing on forward pass kernels and the other on backward pass kernels, though I'm not sure why you need backward pass kernels. we could also split by different kernels like 1x16 group size and 1x1 group size (no reference cuda kernels in AQLM). I'm not sure what the best way to divide work between us though. I'll probably start with 1x16 forward pass kernel

pawarmanasi07 · 2024-12-31T13:33:02Z

Sounds good! I think focusing on the 1x16 group size kernels makes sense as a starting point. I can work on the 1x1 group size kernels while you tackle the 1x16 forward pass implementation.

For the backward pass kernels - you raise a good point about whether they're necessary. Since this is post-training quantization, we likely don't need backward pass optimization unless we're planning to support fine-tuning scenarios?

pawarmanasi07 · 2025-01-01T07:30:09Z

Hi @DerekLiu35 and @jerryzh168, to confirm my tasks - I'll be focusing on optimizing the dequantization for 1x1 group size.
This involves:

Creating an optimized kernel for 1x1 dequantization
Adding benchmarks and tests
Integrating with the existing codebase

While Derek focuses on the 1x16 forward pass kernel implementation.
Since there are no reference CUDA kernels in AQLM for 1x1, should I:

Implement new CUDA kernels for 1x1
Use Triton for 1x1
Or implement both approaches?

Is this the correct understanding of the work division? I just want to ensure I'm heading in the right direction before starting.

pawarmanasi07 · 2025-01-01T07:40:40Z

However would it make more sense to start with Triton implementation for 1x1 first (since we need it as a fallback anyway)
then evaluate if we need CUDA implementation based on performance?

DerekLiu35 · 2025-01-01T21:17:31Z

Yeah I think that would make sense to start with triton fallback first

jerryzh168 added the good first issue Good for newcomers label Oct 29, 2024

jerryzh168 mentioned this issue Oct 31, 2024

[New method] VPTQ Vector Post-Training Quantization Support #1204

Open

DerekLiu35 mentioned this issue Nov 16, 2024

[WIP] Codebook quantization flow #1299

Merged

1 task

jerryzh168 assigned DerekLiu35 Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add codebook (look up table based) quantization flow in torchao #1195

Add codebook (look up table based) quantization flow in torchao #1195

jerryzh168 commented Oct 29, 2024 •

edited

Loading

Tasks

malinjawi commented Nov 1, 2024

DerekLiu35 commented Nov 16, 2024

pawarmanasi07 commented Dec 30, 2024 •

edited

Loading

DerekLiu35 commented Dec 30, 2024 •

edited

Loading

jerryzh168 commented Dec 30, 2024

pawarmanasi07 commented Dec 30, 2024

DerekLiu35 commented Dec 30, 2024

pawarmanasi07 commented Dec 31, 2024

pawarmanasi07 commented Dec 31, 2024

DerekLiu35 commented Dec 31, 2024

pawarmanasi07 commented Dec 31, 2024

pawarmanasi07 commented Jan 1, 2025 •

edited

Loading

pawarmanasi07 commented Jan 1, 2025

DerekLiu35 commented Jan 1, 2025 •

edited

Loading

Add codebook (look up table based) quantization flow in torchao #1195

Add codebook (look up table based) quantization flow in torchao #1195

Comments

jerryzh168 commented Oct 29, 2024 • edited Loading

Tasks

malinjawi commented Nov 1, 2024

DerekLiu35 commented Nov 16, 2024

pawarmanasi07 commented Dec 30, 2024 • edited Loading

DerekLiu35 commented Dec 30, 2024 • edited Loading

jerryzh168 commented Dec 30, 2024

pawarmanasi07 commented Dec 30, 2024

DerekLiu35 commented Dec 30, 2024

pawarmanasi07 commented Dec 31, 2024

pawarmanasi07 commented Dec 31, 2024

DerekLiu35 commented Dec 31, 2024

pawarmanasi07 commented Dec 31, 2024

pawarmanasi07 commented Jan 1, 2025 • edited Loading

pawarmanasi07 commented Jan 1, 2025

DerekLiu35 commented Jan 1, 2025 • edited Loading

jerryzh168 commented Oct 29, 2024 •

edited

Loading

pawarmanasi07 commented Dec 30, 2024 •

edited

Loading

DerekLiu35 commented Dec 30, 2024 •

edited

Loading

pawarmanasi07 commented Jan 1, 2025 •

edited

Loading

DerekLiu35 commented Jan 1, 2025 •

edited

Loading