Refactor iqk_mul_mat.cpp #435

ikawrakow · 2025-05-20T07:00:15Z

I have been putting all matrix multiplication (GEMM) and flash attention (FA) kernels into iqk_mul_mat.cpp. With time it became a giant source file (~18 kLOC) containing heavily templated C++ code. The result: extremely long compilations times (over 2 minutes on a high end CPU, with some users reporting 30 minutes on an Android phone).

This PR splits iqk_mul_mat.cpp into multiple files:

iqk/iqk_gemm_floats.cpp - contains GEMM kernels operating on float tensors
iqk/iqk_gemm_1bit.cpp - contains GEMM kernels for BitNet and IQ1_S, IQ1_M (along with repacked variants)
iqk/iqk_gemm_kquants.cpp - contains GEMM kernels for k-quants and repacked k-quants
iqk/iqk_gemm_iquants.cpp - contains GEMM kernels for i-quants and repacked i-quants
iqk/iqk_gemm_iqk_quants.cpp - GEMM kernels for IQX_K and repacked
iqk/iqk_gemm_legacy_quants.cpp - GEMM kenels for legacy quants (Q4_0, etc.) and repacked
iqk/iqk_mul_mat.cpp now contains just the GEMM business logic and compiles very fast
iqk/fa/iqk_fa_templates.h - FA templates that get included in the FA *.cpp files
iqk/fa/iqk_fa_*_*.cpp - FA template instantiations for specific combinations of K and V attention head sizes

With this, a fresh build with of the iqk folder (with files compiled in parallel) takes

~17 seconds on a Ryzen-7950X (Zen4)
~15 seconds on a Ryzen-5975WX (AVX2)
~13 seconds on a M2-Max (ARM_NEON)

The Zen4 build is longer because we have additional kernels for bf16 not supported natively by the other two platforms.
The GEMM files compile in 5-6 seconds each, so the FA instantiations dominate the build time. One could split them further, but for now I can live with compile times in the range of 15 seconds.

It is a massive change. Testing of all types (50+ when row-interleaved quants are included) on AVX2, Zen4 and ARM_NEON took quite some time. I hope to have covered all possible combinations, but still would appreciate additional testing from people using ik_llama.cpp for CPU-only inference.

Closes #183

If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X

Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before.

Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX

It was broken before the refactoring (the shifts were not correctly applied).

saood06 · 2025-05-20T07:20:58Z

I hope to have covered all possible combinations, but still would appreciate additional testing from people using ik_llama.cpp for CPU-only inference.

Testing the build time: ~7 min compared to ~18 minutes before on my dual socket Xeon E5-2690 v3. It used more threads but still nowhere near saturating my available ones for a large amount of the time. It may have a lower peak memory footprint but I will have to measure that better to tell.

Tested with my standard cmake .. -DGGML_RPC=ON -DGGML_IQK_FA_ALL_QUANTS=1; cmake --build . --config Release -j 48

ikawrakow · 2025-05-20T07:30:51Z

It used more threads but still nowhere near saturating my available ones for a large amount of the time.

It cannot saturate your 48 cores. It needs to build libggml.so first, and this is what it takes to do that:

[  2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[  3%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[  4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[  4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_576_512.cpp.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_192_128.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_128_128.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_256_256.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_96_96.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_64_64.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_floats.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_kquants.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iquants.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iqk_quants.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_1bit.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_legacy_quants.cpp.o
[ 13%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o

With all quants enabled for FA the above takes 36 seconds on my AVX2 box.

Compiling llama.cpp is another piece that takes quite some time, so it should get refactored as well.

saood06 · 2025-05-20T07:51:53Z

It cannot saturate your 48 cores. It needs to build libggml.so first, and this is what it takes to do that:

I know and I'm not expecting it to, but it still did have a much higher usage overall. (I use this machine to do a lot of cross-compiling and builds of other software so I understand what the output of cmake means and I was monitoring it alongside btop).

Compiling llama.cpp is another piece that takes quite some time, so it should get refactored as well.

That piece is fast enough on my machine iqk_mul_mat.cpp was the majority of the time spent before.

Thank you for this, it is a very welcome speed improvement.

cmoncure · 2025-05-22T18:23:28Z

This commit results in a significant performance regression for me, established by git bisect.

My TG drops by about 30% on DeepSeek.

b94cd3b is the first bad commit
commit b94cd3b (HEAD)
Author: Kawrakow [email protected]
Date: Thu May 22 10:05:51 2025 +0300

Refactor iqk_mul_mat.cpp (#435)

ikawrakow · 2025-05-23T05:09:34Z

This commit results in a significant performance regression for me, established by git bisect.

Please file an issue with all the relevant details.

* Refactor iqk: WIP * Refactor iqk: Factor out float GEMM (AVX2/AVX512) * Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512) * Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4 * Refactor iqk: Factor out GEMM for repacked legacy quants * Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV * Refactor iqk: Factor out GEMM for repacked i-quants * Refactor iqk: GEMM kernels are refactored on AVX2/AVX512 * Refactor iqk: factor out 1-bit quants (NEON) * Refactor iqk: factor out k-quants (NEON) * Refactor iqk: factor out floats (NEON) * Also iq4_xs belongs to k-quants * Refactor iqk: factor out iqk quants (NEON) * Refactor iqk: factor out legacy quants (NEON) * Refactor iqk: factor out repacked legacy quants (NEON) * Refactor iqk: factor out repacked k-quants (NEON) * Refactor iqk: factor out repacked iqk quants (NEON) * Refactor iqk: GEMM kernels are refactored on NEON * Refactor iqk: FA compiles If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X * Refactor iqk: FA refactored (Zen4) Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before. * Adding forgotten file * Most helpers don't need to be templates Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX * Fix bf16 * Refactor iqk: FA refactored (NEON) * Forgotten MMQ ref and typo (ikawrakow#431) * Adding forgotten iq5_k_r4 * Fix iq4_k_r4 on NEON * Fix iq4_ks on NEON It was broken before the refactoring (the shifts were not correctly applied). * Fix q8_0 on NEON * Fix q6_0 K cache --------- Co-authored-by: Iwan Kawrakow <[email protected]> Co-authored-by: Nexes the Elder <[email protected]>

Iwan Kawrakow added 30 commits May 17, 2025 12:31

Refactor iqk: WIP

68b782e

Refactor iqk: Factor out float GEMM (AVX2/AVX512)

51a87cf

Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)

f83e64d

Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)

4ef94c2

Refactor iqk: fix AVX2

d355ff9

Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)

2cbbc55

Refactor iqk: fix AVX2

8dae13c

Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)

de5660c

Refactor iqk: fix AVX2

082a9bd

Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)

9b6e75c

Refactor iqk: fix AVX2

d66ec60

Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4

7868545

Refactor iqk: Factor out GEMM for repacked legacy quants

6cd3609

Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV

f501200

Refactor iqk: Factor out GEMM for repacked i-quants

0d96f3b

Refactor iqk: GEMM kernels are refactored on AVX2/AVX512

c63a0af

Refactor iqk: factor out 1-bit quants (NEON)

28b9480

Refactor iqk: factor out k-quants (NEON)

c805a19

Refactor iqk: factor out floats (NEON)

f4ab917

Also iq4_xs belongs to k-quants

3124136

Refactor iqk: factor out iqk quants (NEON)

465d717

Refactor iqk: factor out legacy quants (NEON)

bd1e4d4

Refactor iqk: factor out repacked legacy quants (NEON)

2b8a231

Refactor iqk: factor out repacked k-quants (NEON)

7e59d2b

Refactor iqk: factor out repacked iqk quants (NEON)

7aa2de6

Refactor iqk: GEMM kernels are refactored on NEON

4b4b4fd

Refactor iqk: FA compiles

131e5ac

If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X

Refactor iqk: FA refactored (Zen4)

630279c

Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before.

Adding forgotten file

fbfe79e

Most helpers don't need to be templates

9541631

Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX

Iwan Kawrakow and others added 7 commits May 19, 2025 15:30

Fix bf16

9ae8f75

Refactor iqk: FA refactored (NEON)

65c8e86

Forgotten MMQ ref and typo (#431)

380ab3f

Adding forgotten iq5_k_r4

06efa17

Fix iq4_k_r4 on NEON

7090f17

Fix iq4_ks on NEON

4fdb50b

It was broken before the refactoring (the shifts were not correctly applied).

Fix q8_0 on NEON

5351ec0

Fix q6_0 K cache

0943331

ikawrakow mentioned this pull request May 21, 2025

Trellis quants with CPU inference #441

Merged

4 tasks

ikawrakow merged commit b94cd3b into main May 22, 2025

cmoncure mentioned this pull request May 23, 2025

Bug: Performance regression #450

Closed

ikawrakow mentioned this pull request May 31, 2025

Bug: Don't build ggml-aarch64 regardless of CPU arch type #472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor iqk_mul_mat.cpp #435

Refactor iqk_mul_mat.cpp #435

Uh oh!

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

cmoncure commented May 22, 2025

Uh oh!

ikawrakow commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Refactor iqk_mul_mat.cpp #435

Refactor iqk_mul_mat.cpp #435

Uh oh!

Conversation

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

cmoncure commented May 22, 2025

Uh oh!

ikawrakow commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants