gguf-py : Numpy dequantization for most types #8939

compilade · 2024-08-09T03:42:49Z

This implements dequantization in Python (using Numpy) for Q4_0, Q4_1, Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ1_S, IQ1_M, IQ4_NL, and IQ4_XS, resulting in the same float32 values as the reference C implementations.

This should be useful for #8831

The only types for which dequantization is not implemented are the grouped Q4_0 and Q8_0 variants added in #5780 (because I did not find their reference dequantization functions).

This also adds quantization for Q4_0, Q4_1, Q5_0, and Q5_1. By doing this I've noticed that Q4_0 and Q5_0 (but not the others) have platform-dependant rounding in the reference C version, which depends on whether ggml was compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.

I've verified that all added quantization and dequantization functions result in the same bits as the reference C implementations, by using gguf-py/tests/test_quants.py which I've added for this purpose. It requires building ggml with cmake and BUILD_SHARED_LIBS.

I have read the contributing guidelines
Self-reported review complexity:
- Medium

ggerganov

This also adds quantization for Q4_0, Q4_1, Q5_0, and Q5_1. By doing this I've noticed that Q4_0 and Q5_0 (but not the others) have platform-dependant rounding in the reference C version, which depends on whether ggml was compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.

Would something like this work?

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index d5b91c2d..4c0dd3c8 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -683,11 +683,11 @@ void quantize_row_q4_0_ref(const float * restrict x, block_q4_0 * restrict y, in
         y[i].d = GGML_FP32_TO_FP16(d);
 
         for (int j = 0; j < qk/2; ++j) {
-            const float x0 = x[i*qk + 0    + j]*id;
-            const float x1 = x[i*qk + qk/2 + j]*id;
+            const float x0 = x[i*qk + 0    + j];
+            const float x1 = x[i*qk + qk/2 + j];
 
-            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
-            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
+            const uint8_t xi0 = MIN(15, (int8_t)(fmaf(x0, id, 8.5f)));
+            const uint8_t xi1 = MIN(15, (int8_t)(fmaf(x1, id, 8.5f)));
 
             y[i].qs[j]  = xi0;
             y[i].qs[j] |= xi1 << 4;

compilade · 2024-08-10T16:09:30Z

Would something like this work?

Yes, using fmaf should work. But I would prefer to avoid using FMA in the reference implementation, because in Numpy this has to be done with intermediate float64 values.

Maybe something like MIN(15, ((int8_t) (x * id + 0.5f)) + 8) could work to avoid FMA rounding differences, though, because this problem is not apparent with Q4_1 and Q5_1 which use + 0.5f.

But this makes clang-tidy complain with bugprone-incorrect-roundings (also for the current Q4_1 and Q5_1), which means roundf or lroundf should probably be used instead. (otherwise 0.49999997f rounds to 1.0f and -1.4f rounds to 0.0f, according to the examples there)

@ggerganov Since this problem affects rounding in the reference quantization for Q4_0, Q4_1, Q5_0, and Q5_1, should this be fixed in a separate PR, or should I change them to use roundf here?

ggerganov · 2024-08-10T17:00:53Z

Let's fix it in a separate PR

compilade · 2024-08-14T03:45:38Z

Related to the FMA rounding of Q4_0 and Q5_0, it seems that all k-quants (and I guess also i-quants) have platform-dependent rounding because nearest_int is marked inline, and so it uses FMA when rounding the weights multiplied by the inverted scale (when compiled with FMA support).

And that's not all, the scale selection logic in make_qx_quant also differs when having FMA enabled because the sums like sumlx and suml2 (partially?) use FMA when they can. So only fixing the rounding would not solve the problem for k-quants (and i-quants).

I was working on quantizing Q6_K with Numpy, and I bumped into this when comparing output with the reference implementation.

Now I'm wondering if quantization should really be the same on all platform or not, since FMA does help with reducing some rounding errors (although not much), and it's usually also good for performance when the CPU supports it. Explicitly using FMA everywhere might also work, although a cumulative FMA sum seems very hard to do efficiently in Numpy.

And I'm not sure how to disable FMA only for the reference quantization functions. Maybe by putting them in their own file and using #pragma GCC optimize("no-unsafe-math-optimizations") at the top or setting file-specific options in the build scripts if that doesn't work.

Is platform-independent reproducible quantization worth it? I don't know. It's more complicated than I thought.

ggerganov · 2024-08-15T06:42:02Z

Maybe we should disable the auto-FMA contractions all together in the CPU code (-ffp-contract=off) when doing reference quantizations. Reading some docs, it seems that GCC already disables those by default when the standard is explicitly set to -std=c11:

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffp-contract

gguf-py : Numpy dequantization for most types

5a9edda

github-actions bot added the python python script changes label Aug 9, 2024

compilade added enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes labels Aug 9, 2024

ggerganov approved these changes Aug 9, 2024

View reviewed changes

gguf-py : Numpy dequantization for grid-based i-quants

73bc935

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Aug 11, 2024

compilade merged commit 4134999 into master Aug 11, 2024
10 checks passed

Isotr0py mentioned this pull request Aug 12, 2024

🚨 Support dequantization for most GGML types huggingface/transformers#32625

Merged

5 tasks

compilade mentioned this pull request Aug 15, 2024

Add support for llama.cpp huggingface/transformers#27712

Open

99991 mentioned this pull request Aug 15, 2024

Attributed in huggingface/transformers 99991/pygguf#1

Open

This was referenced Aug 15, 2024

how to convert a finetuned flux model to gguf ?? maybe to Q2 if possible city96/ComfyUI-GGUF#11

Closed

gguf-py : bump version from 0.9.1 to 0.10.0 #9051

Merged

compilade mentioned this pull request Sep 4, 2024

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

Merged

15 tasks

compilade mentioned this pull request Oct 31, 2024

feat(convert_hf_to_gguf): support q4_0 and q4_1 quantifications #10008

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gguf-py : Numpy dequantization for most types #8939

gguf-py : Numpy dequantization for most types #8939

compilade commented Aug 9, 2024 •

edited

Loading

ggerganov left a comment

compilade commented Aug 10, 2024

ggerganov commented Aug 10, 2024

compilade commented Aug 14, 2024 •

edited

Loading

ggerganov commented Aug 15, 2024

gguf-py : Numpy dequantization for most types #8939

gguf-py : Numpy dequantization for most types #8939

Conversation

compilade commented Aug 9, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

compilade commented Aug 10, 2024

ggerganov commented Aug 10, 2024

compilade commented Aug 14, 2024 • edited Loading

ggerganov commented Aug 15, 2024

compilade commented Aug 9, 2024 •

edited

Loading

compilade commented Aug 14, 2024 •

edited

Loading