-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4-bit Integer quantisation #27
Conversation
d455387
to
24330b8
Compare
How to run with GPT-J-6B model?
Steps to reproduce:
|
Due to the quantization changes, I had to transpose a few of the tensors in the model. In order to make it work, you have to convert the original H5 data using the convert-h5-to-ggml.py from this branch.
After you convert the python model to ggml model, you can then use the The process is a bit tedious now, but when the implementation is ready, I will upload the quantized models to Hugging Face and it will be easier. |
Great. Thank you very much for the explanation. I will do this |
it worked and it's impressive.
|
@ocordeiro or anyone else, can you upload the ggml weights to HF, bittorrent, etc.? |
@tmzt it's here until @ggerganov doesn't launch official version: |
I’m not sure I fully understood your spec, but here’s AVX2 decompressor for these blocks: |
@Const-me
For the first one, I have this version, but I don't know if it is optimal yet: https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L2038-L2113 For the second one, I have a version for https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L1816-L1870 Any advice on the implementation and making it more efficient will be appreciated! |
@ggerganov Here’s the codes. The implementation of The implementation of Again, tested very little so could be bugs there, and I have not measured performance. Couple general notes. About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values. Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction). Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor. Take a look how I did that for the hybrid model of Whisper (currently disabled with a macro, but should work): Also, see that answer on stackoverflow: https://stackoverflow.com/a/75567894/126995 I wrote that answer for matrix*vector product, but it is possible to use similar memory layout for matrix*matrix as well. |
Thank you so much - you are the best! I just added AVX2 support to
Already did that today in the
The idea is to reduce memory bandwidth. I think the computation becomes memory-bound on many cores. So it is more important to reduce data size rather than optimizing the calculations. I could be wrong ..
I know! I started doing this with very little knowledge about GEMM and I am sure there is a lot of room for improvements. Edit: fixed wrong quotes |
@ggerganov About the compression for intermediate tensors, I’ve made another function if you want to try, dotProduct_q40_f16 I’m not sure what you’ll find, but it’s possible FP16 intermediates might be slightly faster than Q4 compressed. That block compression is slower than downcasting floats to FP16. And processors often have many megabytes of L3 cache, for example my processor has 16MB. The intermediate tensors which were just computed from something else might still be on that cache. |
Just to cross-reference: 4-bit quantization does not give the expected performance improvement in non-Apple ARM processors. In fact, there is a drastic reduction in performance: ggerganov/whisper.cpp#540 (comment) |
Is there a reason why llama.cpp supports 4 bit quantization on x86 processors but GPTJ does not work with 4 bit and x86? Edit: |
Dolly like GPT-J quantized success but load fail gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096] |
I made a note elsewhere, but I'm finding q4_1 to be worse than q4_0 in at least one instance. |
@ahoho |
close #5 #6 #24
We introduce efficient SIMD 4-bit integer quantisation running on the CPU
First some initial results on M1 Pro:
Language Models:
Here is a short sample run of `GPT-J` inference of 100 tokens: (click to expand)
Whisper:
Here is a short `Whisper Medium` run: (click to expand)
Details
Integer quantisation is a technique used to reduce the model size at the price of some accuracy. Instead of using floating point number to represent the weights of the model, one can use integers + scaling/offset factors to compress them.
There are different ways to perform the quantisation. In this PR, I investigated the following approaches:
Q4_0
A block of
QK
floating point numbersx_i
is represented by 1 scaling factor (f32) +QK/2
bytes. Each byte stores 2 4-bit integer scaling factors in the range[-7, 7]
. The f32 scaling factor is determined asabs(max(x_i))/7
. The compression ratio achieved with this approach compared to simplef16
storage is:ggml/src/ggml.c
Lines 411 to 439 in c686d70
Q4_1
Here we use 1 scaling factor (f32) together with 1 offset factor (f32). The f32 offset factor is determined as the
min(x_i)
, while the f32 scaling factor is now:(max(x_i) - min(x_i))/15
. The integer factors are again packed intoQK/2
bytes, but this time their range is in[0, 15]
. The compression ratio compared to simplef16
storage is:ggml/src/ggml.c
Lines 443 to 488 in c686d70
This approach should be more accurate compared to
Q4_0
, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.GQ
I also did a few experiments with general n-bit quantisation. However, I didn't reach to a proper technique that would allow to vectorise the implementation using SIMD efficiently, so I decided it is not worth it in the end. Most of the attempts can be found in: https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c
Choosing QK
The tradeoff when selecting the optimal value for
QK
is if you choose it too high, then the compression ratio is better, but you lose accuracy. Additionally, not allQK
values can be implemented efficiently - it depends on the available CPU instruction set.So far, I decided to choose
QK = 32
for 128-bitARM_NEON
- it seems this size is more compatible with the available SIMD intrinsics/registers. ForAVX2
support, I thinkQK = 64
might turn out to be a better fit for the 256-bit registers. However, if the performance difference betweenQK = 32
andQK = 64
is not very large, I might end up usingQK = 32
for all architectures - it will make the code significantly simpler.Running
First, convert an existing F16 or F32
ggml
model to 4-bit quantised one:Note: The format of the GPT-2 and GPT-J ggml model files has been changed in this PR, so you cannot directly use an existing model file. You will have to create a new one, using the updated python scripts in this branch.
The Whisper models on the other hand are still compatible, so you can quantise them directly.
You can now simply use the generated quantised model files instead of the regular models as usual.
Implementation progress
Q4_0
Q4_1