Skip to content

Conversation

@ikawrakow
Copy link
Owner

I have been putting all matrix multiplication (GEMM) and flash attention (FA) kernels into iqk_mul_mat.cpp. With time it became a giant source file (~18 kLOC) containing heavily templated C++ code. The result: extremely long compilations times (over 2 minutes on a high end CPU, with some users reporting 30 minutes on an Android phone).

This PR splits iqk_mul_mat.cpp into multiple files:

  • iqk/iqk_gemm_floats.cpp - contains GEMM kernels operating on float tensors
  • iqk/iqk_gemm_1bit.cpp - contains GEMM kernels for BitNet and IQ1_S, IQ1_M (along with repacked variants)
  • iqk/iqk_gemm_kquants.cpp - contains GEMM kernels for k-quants and repacked k-quants
  • iqk/iqk_gemm_iquants.cpp - contains GEMM kernels for i-quants and repacked i-quants
  • iqk/iqk_gemm_iqk_quants.cpp - GEMM kernels for IQX_K and repacked
  • iqk/iqk_gemm_legacy_quants.cpp - GEMM kenels for legacy quants (Q4_0, etc.) and repacked
  • iqk/iqk_mul_mat.cpp now contains just the GEMM business logic and compiles very fast
  • iqk/fa/iqk_fa_templates.h - FA templates that get included in the FA *.cpp files
  • iqk/fa/iqk_fa_*_*.cpp - FA template instantiations for specific combinations of K and V attention head sizes

With this, a fresh build with of the iqk folder (with files compiled in parallel) takes

  • ~17 seconds on a Ryzen-7950X (Zen4)
  • ~15 seconds on a Ryzen-5975WX (AVX2)
  • ~13 seconds on a M2-Max (ARM_NEON)

The Zen4 build is longer because we have additional kernels for bf16 not supported natively by the other two platforms.
The GEMM files compile in 5-6 seconds each, so the FA instantiations dominate the build time. One could split them further, but for now I can live with compile times in the range of 15 seconds.

It is a massive change. Testing of all types (50+ when row-interleaved quants are included) on AVX2, Zen4 and ARM_NEON took quite some time. I hope to have covered all possible combinations, but still would appreciate additional testing from people using ik_llama.cpp for CPU-only inference.

Closes #183

Iwan Kawrakow added 30 commits May 17, 2025 12:31
If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X
Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.

Compilation time drops to 14 second on the Ryzen-5975WX
@saood06
Copy link
Collaborator

saood06 commented May 20, 2025

I hope to have covered all possible combinations, but still would appreciate additional testing from people using ik_llama.cpp for CPU-only inference.

Testing the build time: ~7 min compared to ~18 minutes before on my dual socket Xeon E5-2690 v3. It used more threads but still nowhere near saturating my available ones for a large amount of the time. It may have a lower peak memory footprint but I will have to measure that better to tell.

Tested with my standard cmake .. -DGGML_RPC=ON -DGGML_IQK_FA_ALL_QUANTS=1; cmake --build . --config Release -j 48

@ikawrakow
Copy link
Owner Author

It used more threads but still nowhere near saturating my available ones for a large amount of the time.

It cannot saturate your 48 cores. It needs to build libggml.so first, and this is what it takes to do that:

[  2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[  3%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[  4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[  4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_576_512.cpp.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_192_128.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_128_128.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_256_256.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_96_96.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_64_64.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_floats.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_kquants.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iquants.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iqk_quants.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_1bit.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_legacy_quants.cpp.o
[ 13%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o

With all quants enabled for FA the above takes 36 seconds on my AVX2 box.

Compiling llama.cpp is another piece that takes quite some time, so it should get refactored as well.

@saood06
Copy link
Collaborator

saood06 commented May 20, 2025

It cannot saturate your 48 cores. It needs to build libggml.so first, and this is what it takes to do that:

I know and I'm not expecting it to, but it still did have a much higher usage overall. (I use this machine to do a lot of cross-compiling and builds of other software so I understand what the output of cmake means and I was monitoring it alongside btop).

Compiling llama.cpp is another piece that takes quite some time, so it should get refactored as well.

That piece is fast enough on my machine iqk_mul_mat.cpp was the majority of the time spent before.

Thank you for this, it is a very welcome speed improvement.

@ikawrakow ikawrakow mentioned this pull request May 21, 2025
4 tasks
@ikawrakow ikawrakow merged commit b94cd3b into main May 22, 2025
@cmoncure
Copy link

This commit results in a significant performance regression for me, established by git bisect.

My TG drops by about 30% on DeepSeek.

b94cd3b is the first bad commit
commit b94cd3b (HEAD)
Author: Kawrakow [email protected]
Date: Thu May 22 10:05:51 2025 +0300

Refactor iqk_mul_mat.cpp (#435)

@ikawrakow
Copy link
Owner Author

This commit results in a significant performance regression for me, established by git bisect.

Please file an issue with all the relevant details.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 24, 2025
* Refactor iqk: WIP

* Refactor iqk: Factor out float GEMM (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4

* Refactor iqk: Factor out GEMM for repacked legacy quants

* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV

* Refactor iqk: Factor out GEMM for repacked i-quants

* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512

* Refactor iqk: factor out 1-bit quants (NEON)

* Refactor iqk: factor out k-quants (NEON)

* Refactor iqk: factor out floats (NEON)

* Also iq4_xs belongs to k-quants

* Refactor iqk: factor out iqk quants (NEON)

* Refactor iqk: factor out legacy quants (NEON)

* Refactor iqk: factor out repacked legacy quants (NEON)

* Refactor iqk: factor out repacked k-quants (NEON)

* Refactor iqk: factor out repacked iqk quants (NEON)

* Refactor iqk: GEMM kernels are refactored on NEON

* Refactor iqk: FA compiles

If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X

* Refactor iqk: FA refactored (Zen4)

Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.

* Adding forgotten file

* Most helpers don't need to be templates

Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.

Compilation time drops to 14 second on the Ryzen-5975WX

* Fix bf16

* Refactor iqk: FA refactored (NEON)

* Forgotten MMQ ref and typo (ikawrakow#431)

* Adding forgotten iq5_k_r4

* Fix iq4_k_r4 on NEON

* Fix iq4_ks on NEON

It was broken before the refactoring (the shifts were not correctly
applied).

* Fix q8_0 on NEON

* Fix q6_0 K cache

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Co-authored-by: Nexes the Elder <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: iqk_mul_mat

5 participants