Skip to content

Conversation

@ikawrakow
Copy link
Owner

On to R4 implementation of the new iqk quants.

First IQ4_K

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4. I suspect my AVX2/Zen4 implementation is not optimum, but I did not see a better way for now.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ4_K IQ4_K_R4 Speedup
ARM_NEON 8 58.20 ± 1.03 108.02 ± 1.10 1.856
Zen4 16 182.20 ± 0.38 232.63 ± 0.39 1.277
AVX2 32 206.43 ± 0.49 227.60 ± 0.46 1.103

We get decent performance gains for TG as well.
Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads Q2_K_S Q2_K_R4 Speedup
ARM_NEON 2 8.44 ± 0.02 10.56 ± 0.01 1.251
4 15.90 ± 0.05 19.32 ± 0.14 1.215
8 24.54 ± 0.15 25.16 ± 0.03 1.025
Zen4 1 5.26 ± 0.00 6.73 ± 0.00 1.279
2 9.71 ± 0.01 12.43 ± 0.00 1.269
4 13.48 ± 0.06 14.00 ± 0.03 1.039
AVX2 2 4.02 ± 0.00 6.91 ± 0.00 1.719
4 8.03 ± 0.00 11.13 ± 0.00 1.386
8 11.81 ± 0.00 12.75 ± 0.00 1.079

Iwan Kawrakow added 4 commits December 12, 2024 11:00
On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s
for iq4_k. Applying the extra shift costs a ~6 performance penalty.
PP-512 = 227.60 t/s. The shifts are really costly.
We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k.
@ikawrakow ikawrakow merged commit 2700d3a into main Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants