Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4-bit Integer quantisation #27

Merged
merged 38 commits into from
Mar 29, 2023
Merged

4-bit Integer quantisation #27

merged 38 commits into from
Mar 29, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Feb 26, 2023

close #5 #6 #24

We introduce efficient SIMD 4-bit integer quantisation running on the CPU

First some initial results on M1 Pro:

Language Models:

Model Params Size (old) Time / Token (old) Size (new) Time / Token (new)
GPT-2 1558 M 2976 MB 42 ms 937 MB 17 ms
GPT-J 6 B 11543 MB 125 ms 3610 MB 46 ms
Here is a short sample run of `GPT-J` inference of 100 tokens: (click to expand)
$ ./bin/gpt-j -m models/gpt-j-6B/ggml-model-q4_0.bin -p "This pull request imlpements integer quantization." -t 8 -n 100

main: seed = 1677426680
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
main: number of tokens in prompt = 15

This pull request imlpements integer quantization. We can see that in a lot of cases, we can get at least a one line of code reduction without changing semantics in any way.

To be more explicit about the trade-offs in our analysis. We can see that it is possible to get about a 70% reduction in execution time, and a 25% reduction in memory usage, while adding only about a 1.5% reduction in code size, and only incresing the number of branches.

This is a trade

main: mem per token = 16041732 bytes
main:     load time =  1187.43 ms
main:   sample time =    14.53 ms
main:  predict time =  5199.36 ms / 45.61 ms per token
main:    total time =  6581.01 ms

Whisper:

Model Params Size (old) Mem (old) Size (new) Mem (new)
Whisper Tiny 39 M 74 MB 127 MB 26 MB 79 MB
Whisper Base 74 M 141 MB 215 MB 48 MB 123 MB
Whisper Small 244 M 465 MB 603 MB 153 MB 291 MB
Whisper Medium 769 M 1462 MB 1720 MB 469 MB 726 MB
Whisper Large 1550 M 2951 MB 3336 MB 939 MB 1324 MB
Here is a short `Whisper Medium` run: (click to expand)
$ ./bin/whisper -m models/whisper-medium/ggml-model-q4_0.bin -f ../../whisper.cpp/samples/jfk.wav -t 8

whisper_init_from_file: loading model from 'models/whisper-medium/ggml-model-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: processing '../../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.040]   And so my fellow Americans, ask not what your country can do for you,
[00:00:08.040 --> 00:00:10.900]   ask what you can do for your country.


whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   221.70 ms
whisper_print_timings:      mel time =     8.65 ms
whisper_print_timings:   sample time =    13.65 ms /    29 runs (    0.47 ms per run)
whisper_print_timings:   encode time =  1994.48 ms /     1 runs ( 1994.48 ms per run)
whisper_print_timings:   decode time =   305.18 ms /    29 runs (   10.52 ms per run)
whisper_print_timings:    total time =  2560.79 ms

Details

Integer quantisation is a technique used to reduce the model size at the price of some accuracy. Instead of using floating point number to represent the weights of the model, one can use integers + scaling/offset factors to compress them.

There are different ways to perform the quantisation. In this PR, I investigated the following approaches:

Q4_0

A block of QK floating point numbers x_i is represented by 1 scaling factor (f32) + QK/2 bytes. Each byte stores 2 4-bit integer scaling factors in the range [-7, 7]. The f32 scaling factor is determined as abs(max(x_i))/7. The compression ratio achieved with this approach compared to simple f16 storage is:

C = (4 + QK/2)/(2*QK)

ggml/src/ggml.c

Lines 411 to 439 in c686d70

// scalar
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int l = 0; l < QK; l++) {
const float v = x[i*QK + l];
amax = MAX(amax, fabsf(v));
}
const float d = amax / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
pd[i] = d;
for (int l = 0; l < QK; l += 2) {
const float v0 = x[i*QK + l + 0]*id;
const float v1 = x[i*QK + l + 1]*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb + i*QK/2, pp, sizeof(pp));
}

Q4_1

Here we use 1 scaling factor (f32) together with 1 offset factor (f32). The f32 offset factor is determined as the min(x_i), while the f32 scaling factor is now: (max(x_i) - min(x_i))/15. The integer factors are again packed into QK/2 bytes, but this time their range is in [0, 15]. The compression ratio compared to simple f16 storage is:

C = (8 + QK/2)/(2*QK)

ggml/src/ggml.c

Lines 443 to 488 in c686d70

// method 4
// blocks of QK elements
// represented with 2 floats (min + delta) and QK/2 8-bit ints (i.e QK 4-bit unsigned integer factors)
void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
assert(k % QK == 0);
const int nb = k / QK;
float * restrict pm = (float *) (y);
float * restrict pd = (float *) (pm + nb);
uint8_t * restrict pb = (uint8_t *) (pd + nb);
uint8_t pp[QK/2];
for (int i = 0; i < nb; i++) {
float min = FLT_MAX;
float max = -FLT_MAX;
for (int l = 0; l < QK; l++) {
const float v = x[i*QK + l];
if (v < min) min = v;
if (v > max) max = v;
}
const float d = (max - min) / ((1 << 4) - 1);
const float id = d ? 1.0f/d : 0.0f;
pm[i] = min;
pd[i] = d;
for (int l = 0; l < QK; l += 2) {
const float v0 = (x[i*QK + l + 0] - min)*id;
const float v1 = (x[i*QK + l + 1] - min)*id;
const uint8_t vi0 = round(v0);
const uint8_t vi1 = round(v1);
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb + i*QK/2, pp, sizeof(pp));
}
}

This approach should be more accurate compared to Q4_0, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.

GQ

I also did a few experiments with general n-bit quantisation. However, I didn't reach to a proper technique that would allow to vectorise the implementation using SIMD efficiently, so I decided it is not worth it in the end. Most of the attempts can be found in: https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c

Choosing QK

The tradeoff when selecting the optimal value for QK is if you choose it too high, then the compression ratio is better, but you lose accuracy. Additionally, not all QK values can be implemented efficiently - it depends on the available CPU instruction set.

So far, I decided to choose QK = 32 for 128-bit ARM_NEON - it seems this size is more compatible with the available SIMD intrinsics/registers. For AVX2 support, I think QK = 64 might turn out to be a better fit for the 256-bit registers. However, if the performance difference between QK = 32 and QK = 64 is not very large, I might end up using QK = 32 for all architectures - it will make the code significantly simpler.

Running

First, convert an existing F16 or F32 ggml model to 4-bit quantised one:

# quantize GPT-2 model using Q4_0
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-2 model using Q4_1
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize GPT-J model using Q4_0
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-J model using Q4_1
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize Whisper model using Q4_0
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize Whisper model using Q4_1
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

Note: The format of the GPT-2 and GPT-J ggml model files has been changed in this PR, so you cannot directly use an existing model file. You will have to create a new one, using the updated python scripts in this branch.
The Whisper models on the other hand are still compatible, so you can quantise them directly.

You can now simply use the generated quantised model files instead of the regular models as usual.

Implementation progress

Q4_0

  • Scalar
  • ARM_NEON
  • AVX2
  • WASM SIMD

Q4_1

  • Scalar
  • ARM_NEON
  • AVX2
  • WASM SIMD

@ocordeiro
Copy link
Contributor

How to run with GPT-J-6B model?
I'm getting the following error:

gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]

Steps to reproduce:

#  Get this branch
git checkout gq && git pull

# Build GPT-J and GPT-J-quantize
make gpt-j && make gpt-j-quantize

# Download GPT-J-6B model
./examples/gpt-j/download-ggml-model.sh 6B

# Quantize GPT-J-6B model 
./bin/gpt-j-quantize ../models/gpt-j-6B/ggml-model.bin ../gpt-j-ggml-model-q4_0.bin 2

#  Run GPT-J-6B model
./build/bin/gpt-j -m ./gpt-j-ggml-model-q4_0.bin -p "This is an example"

  • Environment: M1 Air - macOS 13.2

@ggerganov
Copy link
Owner Author

@ocordeiro

Due to the quantization changes, I had to transpose a few of the tensors in the model.
So this makes the old ggml files incompatible with the quantization branch.

In order to make it work, you have to convert the original H5 data using the convert-h5-to-ggml.py from this branch.
To do that, you need to download the full GPT-J model from here: https://huggingface.co/EleutherAI/gpt-j-6B
And run the command:

python3 examples/gpt-j/convert-h5-to-ggml.py ./models/gpt-j-6B 0

After you convert the python model to ggml model, you can then use the gpt-j-quantize command to quantize the ggml model.

The process is a bit tedious now, but when the implementation is ready, I will upload the quantized models to Hugging Face and it will be easier.

@ocordeiro
Copy link
Contributor

Great. Thank you very much for the explanation. I will do this

@ocordeiro
Copy link
Contributor

it worked and it's impressive.
here are the results on my M1 Air 8GB:

main: mem per token = 15976132 bytes
main:     load time =  2016.22 ms
main:   sample time =    32.71 ms
main:  predict time = 18798.93 ms / 92.61 ms per token
main:    total time = 21609.82 ms

@tmzt
Copy link

tmzt commented Mar 5, 2023

@ocordeiro or anyone else,

can you upload the ggml weights to HF, bittorrent, etc.?

@ocordeiro
Copy link
Contributor

@tmzt it's here until @ggerganov doesn't launch official version:
https://huggingface.co/ocordeiro/ggml-gpt-j-6b-q4_0

@Const-me
Copy link

I’m not sure I fully understood your spec, but here’s AVX2 decompressor for these blocks:
https://gist.github.com/Const-me/a0529a8c9885d371138a1c50e0622040
Tested very little, haven’t tested performance at all, but still, it seems to work for that one test which I have implemented.
Feel free to copy-paste.

@ggerganov
Copy link
Owner Author

@Const-me
Awesome! Thank you for this.
During the inference the most crucial parts that have to run fast are:

For the first one, I have this version, but I don't know if it is optimal yet:

https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L2038-L2113

For the second one, I have a version for QK == 64, but I need one for QK == 32:

https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L1816-L1870

Any advice on the implementation and making it more efficient will be appreciated!

@Const-me
Copy link

@ggerganov Here’s the codes.
https://gist.github.com/Const-me/65ff46c31553493d13fcd6646e162494

The implementation of quantize_row_q4_0 is in compressRow40 function in that source file.

The implementation of ggml_vec_dot_q4_0 is in the dotProductCompressed40 function in that source file.

Again, tested very little so could be bugs there, and I have not measured performance.

Couple general notes.

About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.

Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).

Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor. Take a look how I did that for the hybrid model of Whisper (currently disabled with a macro, but should work):
https://github.com/Const-me/Whisper/blob/master/Whisper/CPU/mulMatImpl.h
And the rest of the mulMat*.* files in that folder.
That implementation is very specialized, only supports FP32*FP16, and I only tested it for decode step of the algorithm. But still, it’s substantially faster than what’s in GGML.

Also, see that answer on stackoverflow: https://stackoverflow.com/a/75567894/126995 I wrote that answer for matrix*vector product, but it is possible to use similar memory layout for matrix*matrix as well.

@ggerganov
Copy link
Owner Author

ggerganov commented Mar 11, 2023

@Const-me

Thank you so much - you are the best!

I just added AVX2 support to llama.cpp thanks to your code snippets: ggerganov/llama.cpp@f1eaff4

About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.

Already did that today in the llama.cpp repo - it was necessary for consolidating the larger LLaMA models anyway.
Will need to migrate the changes here at some point.

Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).

The idea is to reduce memory bandwidth. I think the computation becomes memory-bound on many cores. So it is more important to reduce data size rather than optimizing the calculations. I could be wrong ..

Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor.

I know! I started doing this with very little knowledge about GEMM and I am sure there is a lot of room for improvements.
Thank you again for all your help.

Edit: fixed wrong quotes

@Const-me
Copy link

@ggerganov About the compression for intermediate tensors, I’ve made another function if you want to try, dotProduct_q40_f16 I’m not sure what you’ll find, but it’s possible FP16 intermediates might be slightly faster than Q4 compressed.

That block compression is slower than downcasting floats to FP16. And processors often have many megabytes of L3 cache, for example my processor has 16MB. The intermediate tensors which were just computed from something else might still be on that cache.

Narsil added a commit to huggingface/safetensors that referenced this pull request Mar 17, 2023
@meakbiyik
Copy link

meakbiyik commented Mar 19, 2023

Just to cross-reference: 4-bit quantization does not give the expected performance improvement in non-Apple ARM processors. In fact, there is a drastic reduction in performance: ggerganov/whisper.cpp#540 (comment)

@mallorbc
Copy link

mallorbc commented Mar 26, 2023

Is there a reason why llama.cpp supports 4 bit quantization on x86 processors but GPTJ does not work with 4 bit and x86?

Edit:
Looking at some of the commits and edit history for the main comment, it seems that perhaps x86 is supported now and the comment just doesn't reflect that. I see commits relating to x86 3 weeks ago and the last time the main comment was updated was a month ago. I will try to see if I get 4bit working on x86.

@iamfaith
Copy link

Dolly like GPT-J quantized success but load fail

gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]

@ggerganov ggerganov merged commit acd4aee into master Mar 29, 2023
@ggerganov ggerganov deleted the gq branch March 29, 2023 19:21
@ahoho
Copy link

ahoho commented Apr 2, 2023

I made a note elsewhere, but I'm finding q4_1 to be worse than q4_0 in at least one instance.

@ggerganov
Copy link
Owner Author

@ahoho
There might be a bug in the ARM_NEON Q4_1 implementation. I got additional reports indicating that.
Still haven't had time to look into that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Implement 8-bit GPT-J
8 participants