-
Couldn't load subscription status.
- Fork 155
Refactor iqk_mul_mat.cpp #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X
Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before.
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX
It was broken before the refactoring (the shifts were not correctly applied).
Testing the build time: ~7 min compared to ~18 minutes before on my dual socket Xeon E5-2690 v3. It used more threads but still nowhere near saturating my available ones for a large amount of the time. It may have a lower peak memory footprint but I will have to measure that better to tell. Tested with my standard |
It cannot saturate your 48 cores. It needs to build With all quants enabled for FA the above takes 36 seconds on my Compiling |
I know and I'm not expecting it to, but it still did have a much higher usage overall. (I use this machine to do a lot of cross-compiling and builds of other software so I understand what the output of cmake means and I was monitoring it alongside btop).
That piece is fast enough on my machine iqk_mul_mat.cpp was the majority of the time spent before. Thank you for this, it is a very welcome speed improvement. |
|
This commit results in a significant performance regression for me, established by git bisect. My TG drops by about 30% on DeepSeek. b94cd3b is the first bad commit |
Please file an issue with all the relevant details. |
* Refactor iqk: WIP * Refactor iqk: Factor out float GEMM (AVX2/AVX512) * Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512) * Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4 * Refactor iqk: Factor out GEMM for repacked legacy quants * Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV * Refactor iqk: Factor out GEMM for repacked i-quants * Refactor iqk: GEMM kernels are refactored on AVX2/AVX512 * Refactor iqk: factor out 1-bit quants (NEON) * Refactor iqk: factor out k-quants (NEON) * Refactor iqk: factor out floats (NEON) * Also iq4_xs belongs to k-quants * Refactor iqk: factor out iqk quants (NEON) * Refactor iqk: factor out legacy quants (NEON) * Refactor iqk: factor out repacked legacy quants (NEON) * Refactor iqk: factor out repacked k-quants (NEON) * Refactor iqk: factor out repacked iqk quants (NEON) * Refactor iqk: GEMM kernels are refactored on NEON * Refactor iqk: FA compiles If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X * Refactor iqk: FA refactored (Zen4) Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before. * Adding forgotten file * Most helpers don't need to be templates Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX * Fix bf16 * Refactor iqk: FA refactored (NEON) * Forgotten MMQ ref and typo (ikawrakow#431) * Adding forgotten iq5_k_r4 * Fix iq4_k_r4 on NEON * Fix iq4_ks on NEON It was broken before the refactoring (the shifts were not correctly applied). * Fix q8_0 on NEON * Fix q6_0 K cache --------- Co-authored-by: Iwan Kawrakow <[email protected]> Co-authored-by: Nexes the Elder <[email protected]>
I have been putting all matrix multiplication (GEMM) and flash attention (FA) kernels into
iqk_mul_mat.cpp. With time it became a giant source file (~18 kLOC) containing heavily templated C++ code. The result: extremely long compilations times (over 2 minutes on a high end CPU, with some users reporting 30 minutes on an Android phone).This PR splits
iqk_mul_mat.cppinto multiple files:iqk/iqk_gemm_floats.cpp- contains GEMM kernels operating on float tensorsiqk/iqk_gemm_1bit.cpp- contains GEMM kernels for BitNet andIQ1_S, IQ1_M(along with repacked variants)iqk/iqk_gemm_kquants.cpp- contains GEMM kernels for k-quants and repacked k-quantsiqk/iqk_gemm_iquants.cpp- contains GEMM kernels for i-quants and repacked i-quantsiqk/iqk_gemm_iqk_quants.cpp- GEMM kernels forIQX_Kand repackediqk/iqk_gemm_legacy_quants.cpp- GEMM kenels for legacy quants (Q4_0, etc.) and repackediqk/iqk_mul_mat.cppnow contains just the GEMM business logic and compiles very fastiqk/fa/iqk_fa_templates.h- FA templates that get included in the FA*.cppfilesiqk/fa/iqk_fa_*_*.cpp- FA template instantiations for specific combinations of K and V attention head sizesWith this, a fresh build with of the
iqkfolder (with files compiled in parallel) takesThe Zen4 build is longer because we have additional kernels for
bf16not supported natively by the other two platforms.The GEMM files compile in 5-6 seconds each, so the FA instantiations dominate the build time. One could split them further, but for now I can live with compile times in the range of 15 seconds.
It is a massive change. Testing of all types (50+ when row-interleaved quants are included) on
AVX2, Zen4andARM_NEONtook quite some time. I hope to have covered all possible combinations, but still would appreciate additional testing from people usingik_llama.cppfor CPU-only inference.Closes #183