CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K #417
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

This PR follows in the footsteps of #374, and is the next step towards complete implementation of quantized matrix multiplications (a.k.a. MMQ) for the
IQX_Kquants.We get in the range of 15% performance improvement compared to the existing implementation that dequantizes to
fp16and then uses cuBLAS to perform the matrix multiplications.Another benefit is avoiding the numerical issues observed for DeepSeek models when using
fp16arithmetic (see #261). It also potentially leads to CUDA compute buffer size reduction because the intermediate buffer for the dequantized tensor is not required.I have reused the existing matrix multiplication kernels, providing only the unpacking of the quantized data into the tiles used in the kernels. As such, performance is largely determined by the kernel (blocks of 16 or blocks of 32), and the unpacking cost (converting the packed data into
int8_tvalues ready for matrix multiplications). This is best illustrated with the following graph. Model is LLaMA-3.1-8B, GPU is RTX-4080. All quantizations are done using--output-tensor-type q6_K --pure.Q4_0is the fastest (black circles). It uses a "type-0" kernel for a block size of 32. Next isIQ4_KS(red circles), which uses the same kernel asQ4_0. The ~10% lower performance is due to the higher unpacking cost. Next isQ3_K(green circles), which has low unpacking cost (at least when compared toIQX_Kquants), but uses the kernel for a block size of 16. We see a ~30% drop in performance compared toQ4_0because of that. Then come theIQ4_K(blue circles),IQ5_K(magenta circles) andIQ6_K(cyan circles) in this PR. They all use the kernel for block size 16, but are ~7-9% slower thanQ3_Kdue to the higher unpacking cost.IQ4_K, IQ5_KandIQ6_Kon the main branch are shown with squares in corresponding colors to illustrate the performance gain in this PR. The matrix multiplication kernels are inherited from mainlinellama.cpp. Based on the graph, it would make sense to try to optimize two aspects of these kernels:Q4_0receives a huge amount of attention inllama.cpp, most likely the block size 32 kernel was optimized for it.Q4_0is a very simple quant, so unpacking cost is (nearly) negligible. When unpacking host is high, it makes sense to reuse a tile more times to amortize the unpacking cost. This is what I have done in the CPU implementation where most quantization types are on par withQ4_0(or even outperform it)Such efforts are left for a future PR.