Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gguf-py : Numpy dequantization for most types #8939

Merged
merged 2 commits into from
Aug 11, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Aug 9, 2024

This implements dequantization in Python (using Numpy) for Q4_0, Q4_1, Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ1_S, IQ1_M, IQ4_NL, and IQ4_XS, resulting in the same float32 values as the reference C implementations.

This should be useful for #8831

The only types for which dequantization is not implemented are the grouped Q4_0 and Q8_0 variants added in #5780 (because I did not find their reference dequantization functions).

This also adds quantization for Q4_0, Q4_1, Q5_0, and Q5_1. By doing this I've noticed that Q4_0 and Q5_0 (but not the others) have platform-dependant rounding in the reference C version, which depends on whether ggml was compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.

I've verified that all added quantization and dequantization functions result in the same bits as the reference C implementations, by using gguf-py/tests/test_quants.py which I've added for this purpose. It requires building ggml with cmake and BUILD_SHARED_LIBS.


@github-actions github-actions bot added the python python script changes label Aug 9, 2024
@compilade compilade added enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes labels Aug 9, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also adds quantization for Q4_0, Q4_1, Q5_0, and Q5_1. By doing this I've noticed that Q4_0 and Q5_0 (but not the others) have platform-dependant rounding in the reference C version, which depends on whether ggml was compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.

Would something like this work?

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index d5b91c2d..4c0dd3c8 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -683,11 +683,11 @@ void quantize_row_q4_0_ref(const float * restrict x, block_q4_0 * restrict y, in
         y[i].d = GGML_FP32_TO_FP16(d);
 
         for (int j = 0; j < qk/2; ++j) {
-            const float x0 = x[i*qk + 0    + j]*id;
-            const float x1 = x[i*qk + qk/2 + j]*id;
+            const float x0 = x[i*qk + 0    + j];
+            const float x1 = x[i*qk + qk/2 + j];
 
-            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
-            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
+            const uint8_t xi0 = MIN(15, (int8_t)(fmaf(x0, id, 8.5f)));
+            const uint8_t xi1 = MIN(15, (int8_t)(fmaf(x1, id, 8.5f)));
 
             y[i].qs[j]  = xi0;
             y[i].qs[j] |= xi1 << 4;

@compilade
Copy link
Collaborator Author

Would something like this work?

Yes, using fmaf should work. But I would prefer to avoid using FMA in the reference implementation, because in Numpy this has to be done with intermediate float64 values.

Maybe something like MIN(15, ((int8_t) (x * id + 0.5f)) + 8) could work to avoid FMA rounding differences, though, because this problem is not apparent with Q4_1 and Q5_1 which use + 0.5f.

But this makes clang-tidy complain with bugprone-incorrect-roundings (also for the current Q4_1 and Q5_1), which means roundf or lroundf should probably be used instead. (otherwise 0.49999997f rounds to 1.0f and -1.4f rounds to 0.0f, according to the examples there)

@ggerganov Since this problem affects rounding in the reference quantization for Q4_0, Q4_1, Q5_0, and Q5_1, should this be fixed in a separate PR, or should I change them to use roundf here?

@ggerganov
Copy link
Owner

Let's fix it in a separate PR

@compilade compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Aug 11, 2024
@compilade compilade merged commit 4134999 into master Aug 11, 2024
10 checks passed
@compilade
Copy link
Collaborator Author

compilade commented Aug 14, 2024

Related to the FMA rounding of Q4_0 and Q5_0, it seems that all k-quants (and I guess also i-quants) have platform-dependent rounding because nearest_int is marked inline, and so it uses FMA when rounding the weights multiplied by the inverted scale (when compiled with FMA support).

And that's not all, the scale selection logic in make_qx_quant also differs when having FMA enabled because the sums like sumlx and suml2 (partially?) use FMA when they can. So only fixing the rounding would not solve the problem for k-quants (and i-quants).

I was working on quantizing Q6_K with Numpy, and I bumped into this when comparing output with the reference implementation.

Now I'm wondering if quantization should really be the same on all platform or not, since FMA does help with reducing some rounding errors (although not much), and it's usually also good for performance when the CPU supports it. Explicitly using FMA everywhere might also work, although a cumulative FMA sum seems very hard to do efficiently in Numpy.

And I'm not sure how to disable FMA only for the reference quantization functions. Maybe by putting them in their own file and using #pragma GCC optimize("no-unsafe-math-optimizations") at the top or setting file-specific options in the build scripts if that doesn't work.

Is platform-independent reproducible quantization worth it? I don't know. It's more complicated than I thought.

@ggerganov
Copy link
Owner

Maybe we should disable the auto-FMA contractions all together in the CPU code (-ffp-contract=off) when doing reference quantizations. Reading some docs, it seems that GCC already disables those by default when the standard is explicitly set to -std=c11:

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffp-contract

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants