Skip to content

Conversation

@ikawrakow
Copy link
Owner

Adding IQ5_K with 4 interleaved rows.

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ5_K IQ5_K_R4 Speedup
ARM_NEON 8 53.80 ± 1.08 93.33 ± 2.02 1.735
Zen4 16 168.09 ± 0.58 230.23 ± 0.23 1.370
AVX2 32 177.16 ± 0.31 231.50 ± 0.43 1.307

TG does not look good on AVX2/Zen4. On ARM_NEON we get a decent performance gain.
Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads IQ5_K IQ5_K_R4 Speedup
ARM_NEON 2 5.92 ± 0.07 6.98 ± 0.00 1.179
4 11.53 ± 0.01 13.35 ± 0.01 1.158
8 20.29 ± 0.46 21.17 ± 0.18 1.043

@ikawrakow ikawrakow merged commit 59d742b into main Dec 18, 2024
@saood06
Copy link
Collaborator

saood06 commented Mar 27, 2025

TG does not look good on AVX2/Zen4

Does this mean regression compared to non-interleaved or just no benefit?

@ikawrakow
Copy link
Owner Author

I don't remember. But the "Better Zen4" commit in the PR says "But TG is still slower than iq5_k".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants