Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


IQX_Kquants offer better quantization quality for the same amount of bits spent compared to k- and i-quants. But on CUDA they are slower for prompt processing (PP) because matrix multiplications are done via dequantize->cuBLAS, so I thought it is time to fix this.This PR adds quantized matrix multiplications, also known as MMQ, for
IQ4_KS.The following graph shows PP performance as a function of the number of tokens in the KV cache
N_KVfor the main branch (black) and the PR (red). Model is LLaMA-3.1-8B-Instruct, GPU is RTX-4080. We see a very nice performance improvement in the range of 25%.Main branch
PR
Are you wondering why PP performance for
N_KV = 0is significantly lower? I did as well, so I checkedllama-sweep-bench, the tool with which the data for this graph is generated. Warm-up is done via a single TG run. I checked that if I add another warn-up run withn_ubatchtokens, performance forN_KV = 0becomes higher thanN_KV = 512as expected. I guess, I will submit a separate PR for that.TG performance is not affected at all by the PR, so no graph for that.