Q8_K_R8: Fastest quantized matrix multiplications #141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds
Q8_K_R8- 8-rows interleaved version ofQ8_K. With that, we break the world record in prompt processing speed. Here is what we get for PP-512 with LLaMA-3.1-8B onZen4(Ryzen-7950X),AVX2(Ryzen-5975WX) andARM_NEON(M2-Max):On the Ryzen-7950X, which provides native
bf16support, this is nearly 60% faster thanbf16. On the M2-Max, which has nativefp16support,Q8_K_R8is 87% faster thanfp16!Note on AVX2: In the
AVX2implementation one needs to use the_mm256_madd_epi16(x, y)instruction, wherexholds unsigned 8-bit integers andyhas signed 8-bit integers. In the initial implementation I forgot for the 177'th time that the unsigned integers still need to be within0...127, else adding up two adjacent products (as the instruction does) may overflow theint16_trange (and gets silently truncated if it does), so I was making theQ8_K_R8quants unsigned (simplyxor 0x80). This implementation resulted in 354 t/s on the Ryzen-5975WX. Sadly, one needs to "unsign" theQ8_K_R8quants with_mm256_sign_epi8(x, x), and then apply the sign to the activation quants before taking the dot product. This is quite costly andAVX2performance drops to 293 t/s. Being curious about the effect that theint16_toverflow might have, I computed LLaMA-3.1-8B-Instruct perplexity (context 512 tokens) with the original and with the correct implementation. I getPPL = 7.3725with the overflowing variant, andPPL = 7.3443with the correct implementation. I.e., the effect is small but noticeable.