Skip to content

Conversation

@ikawrakow
Copy link
Owner

This PR adds Q8_K_R8 - 8-rows interleaved version of Q8_K. With that, we break the world record in prompt processing speed. Here is what we get for PP-512 with LLaMA-3.1-8B on Zen4 (Ryzen-7950X), AVX2 (Ryzen-5975WX) and ARM_NEON (M2-Max):

Platform PP-512 (Q8_0_R4) PP-512 (Q8_K_R8) Speedup
ARM_NEON 128.29 ± 1.50 172.52 ± 4.17 1.345
Zen4 268.98 ± 0.31 368.85 ± 0.73 1.371
AVX2 234.40 ± 0.60 293.72 ± 0.34 1.253

On the Ryzen-7950X, which provides native bf16 support, this is nearly 60% faster than bf16. On the M2-Max, which has native fp16 support, Q8_K_R8 is 87% faster than fp16!

Note on AVX2: In the AVX2 implementation one needs to use the _mm256_madd_epi16(x, y) instruction, where x holds unsigned 8-bit integers and y has signed 8-bit integers. In the initial implementation I forgot for the 177'th time that the unsigned integers still need to be within 0...127, else adding up two adjacent products (as the instruction does) may overflow the int16_t range (and gets silently truncated if it does), so I was making the Q8_K_R8 quants unsigned (simply xor 0x80). This implementation resulted in 354 t/s on the Ryzen-5975WX. Sadly, one needs to "unsign" the Q8_K_R8 quants with _mm256_sign_epi8(x, x), and then apply the sign to the activation quants before taking the dot product. This is quite costly and AVX2 performance drops to 293 t/s. Being curious about the effect that the int16_t overflow might have, I computed LLaMA-3.1-8B-Instruct perplexity (context 512 tokens) with the original and with the correct implementation. I get PPL = 7.3725 with the overflowing variant, and PPL = 7.3443 with the correct implementation. I.e., the effect is small but noticeable.

Iwan Kawrakow added 5 commits December 13, 2024 18:21
We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!
I was worried that we don't have enough vector registrers on
AVX2, but it looks like it handles it just fine. We get
PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX.
Slightly slower than the Zen4 version with double the threads,
but still a huge upgrade compared to Q8_0_R4.
We get PP-512(LLaMA-3.1-8B) = 159.2 t/s.
Compare this to the 128 t/s we have fr Q8_0_R4.
Why?
* On AVX2 _mm256_maddubs_epi16() may overflow, so we need to
  stay within the signed int range and use _mm256_sign_epi8.
  Not yet tested on the AVX2 comp, vut expect major slowdown.
* It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8()
  needed tto convert from unsigned to signed seems to be extremely
  slow on the M2-Max
* We only lose ~0.5% in oerformance on Zen4 (there the exclusive
  or that we now use to convert fro signed to unsigned seems to be
  much faster than on M2-Max)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants