Q8_K_R8: Fastest quantized matrix multiplications #141

ikawrakow · 2024-12-14T07:26:42Z

This PR adds Q8_K_R8 - 8-rows interleaved version of Q8_K. With that, we break the world record in prompt processing speed. Here is what we get for PP-512 with LLaMA-3.1-8B on Zen4 (Ryzen-7950X), AVX2 (Ryzen-5975WX) and ARM_NEON (M2-Max):

Platform	PP-512 (Q8_0_R4)	PP-512 (Q8_K_R8)	Speedup
ARM_NEON	128.29 ± 1.50	172.52 ± 4.17	1.345
Zen4	268.98 ± 0.31	368.85 ± 0.73	1.371
AVX2	234.40 ± 0.60	293.72 ± 0.34	1.253

On the Ryzen-7950X, which provides native bf16 support, this is nearly 60% faster than bf16. On the M2-Max, which has native fp16 support, Q8_K_R8 is 87% faster than fp16!

Note on AVX2: In the AVX2 implementation one needs to use the _mm256_madd_epi16(x, y) instruction, where x holds unsigned 8-bit integers and y has signed 8-bit integers. In the initial implementation I forgot for the 177'th time that the unsigned integers still need to be within 0...127, else adding up two adjacent products (as the instruction does) may overflow the int16_t range (and gets silently truncated if it does), so I was making the Q8_K_R8 quants unsigned (simply xor 0x80). This implementation resulted in 354 t/s on the Ryzen-5975WX. Sadly, one needs to "unsign" the Q8_K_R8 quants with _mm256_sign_epi8(x, x), and then apply the sign to the activation quants before taking the dot product. This is quite costly and AVX2 performance drops to 293 t/s. Being curious about the effect that the int16_t overflow might have, I computed LLaMA-3.1-8B-Instruct perplexity (context 512 tokens) with the original and with the correct implementation. I get PPL = 7.3725 with the overflowing variant, and PPL = 7.3443 with the correct implementation. I.e., the effect is small but noticeable.

We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!

I was worried that we don't have enough vector registrers on AVX2, but it looks like it handles it just fine. We get PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX. Slightly slower than the Zen4 version with double the threads, but still a huge upgrade compared to Q8_0_R4.

We get PP-512(LLaMA-3.1-8B) = 159.2 t/s. Compare this to the 128 t/s we have fr Q8_0_R4.

Why? * On AVX2 _mm256_maddubs_epi16() may overflow, so we need to stay within the signed int range and use _mm256_sign_epi8. Not yet tested on the AVX2 comp, vut expect major slowdown. * It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8() needed tto convert from unsigned to signed seems to be extremely slow on the M2-Max * We only lose ~0.5% in oerformance on Zen4 (there the exclusive or that we now use to convert fro signed to unsigned seems to be much faster than on M2-Max)

Iwan Kawrakow added 5 commits December 13, 2024 18:21

q8_k_r8: fastest matrix multiplication known to human kind

93a85c6

We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!

q8_k_r4: NEON

89510cc

We get PP-512(LLaMA-3.1-8B) = 159.2 t/s. Compare this to the 128 t/s we have fr Q8_0_R4.

Shutup useless compiler warnings

e65f7e8

ikawrakow merged commit 20758ed into main Dec 14, 2024

ikawrakow mentioned this pull request Dec 14, 2024

BF16_R16 - 16 interleaved bf16 rows #142

Merged

ikawrakow mentioned this pull request Jun 16, 2025

Much faster CPU prompt processing (part 1) #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Q8_K_R8: Fastest quantized matrix multiplications #141

Q8_K_R8: Fastest quantized matrix multiplications #141

Uh oh!

ikawrakow commented Dec 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Q8_K_R8: Fastest quantized matrix multiplications #141

Q8_K_R8: Fastest quantized matrix multiplications #141

Uh oh!

Conversation

ikawrakow commented Dec 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants