-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gguf-py : Numpy dequantization for most types #8939
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also adds quantization for Q4_0, Q4_1, Q5_0, and Q5_1. By doing this I've noticed that Q4_0 and Q5_0 (but not the others) have platform-dependant rounding in the reference C version, which depends on whether ggml was compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.
Would something like this work?
diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index d5b91c2d..4c0dd3c8 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -683,11 +683,11 @@ void quantize_row_q4_0_ref(const float * restrict x, block_q4_0 * restrict y, in
y[i].d = GGML_FP32_TO_FP16(d);
for (int j = 0; j < qk/2; ++j) {
- const float x0 = x[i*qk + 0 + j]*id;
- const float x1 = x[i*qk + qk/2 + j]*id;
+ const float x0 = x[i*qk + 0 + j];
+ const float x1 = x[i*qk + qk/2 + j];
- const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
- const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
+ const uint8_t xi0 = MIN(15, (int8_t)(fmaf(x0, id, 8.5f)));
+ const uint8_t xi1 = MIN(15, (int8_t)(fmaf(x1, id, 8.5f)));
y[i].qs[j] = xi0;
y[i].qs[j] |= xi1 << 4;
Yes, using Maybe something like But this makes @ggerganov Since this problem affects rounding in the reference quantization for |
Let's fix it in a separate PR |
Related to the FMA rounding of And that's not all, the scale selection logic in I was working on quantizing Now I'm wondering if quantization should really be the same on all platform or not, since FMA does help with reducing some rounding errors (although not much), and it's usually also good for performance when the CPU supports it. Explicitly using FMA everywhere might also work, although a cumulative FMA sum seems very hard to do efficiently in Numpy. And I'm not sure how to disable FMA only for the reference quantization functions. Maybe by putting them in their own file and using Is platform-independent reproducible quantization worth it? I don't know. It's more complicated than I thought. |
Maybe we should disable the auto-FMA contractions all together in the CPU code ( https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffp-contract |
This implements dequantization in Python (using Numpy) for
Q4_0
,Q4_1
,Q5_0
,Q5_1
,Q2_K
,Q3_K
,Q4_K
,Q5_K
,Q6_K
,IQ2_XXS
,IQ2_XS
,IQ2_S
,IQ3_XXS
,IQ3_S
,IQ1_S
,IQ1_M
,IQ4_NL
, andIQ4_XS
, resulting in the samefloat32
values as the reference C implementations.This should be useful for #8831
The only types for which dequantization is not implemented are the grouped
Q4_0
andQ8_0
variants added in #5780 (because I did not find their reference dequantization functions).This also adds quantization for
Q4_0
,Q4_1
,Q5_0
, andQ5_1
. By doing this I've noticed thatQ4_0
andQ5_0
(but not the others) have platform-dependant rounding in the reference C version, which depends on whetherggml
was compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.I've verified that all added quantization and dequantization functions result in the same bits as the reference C implementations, by using
gguf-py/tests/test_quants.py
which I've added for this purpose. It requires buildingggml
withcmake
andBUILD_SHARED_LIBS
.