Faster CPU prompt processing for Q4_K and Q5_K #525
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These two quantization types are quite popular, so I thought it makes sense to improve their performance. The repacked variants
Q4_K_R4andQ5_K_R4do not have a CUDA implementation, so repacking is not useful in a hybrid CPU/GPU setup where it may be better to offload tensors stored in RAM to the GPU when processing large batched.The PR uses the same trick as #515, #516, #517, #518. When processing batches
>= 32tokens,Q4_KorQ5_Kquantized tensors are repacked on-the-fly toQ8_1_R8.Here some sweep-bench results for LLaMA-3.1-8B-Instruct on a Ryzen-7950X CPU
Q4_K, main branch
Q4_K_R4
Q4_K, PR
Q5_K, main branch
Q5_K_R4
Q5_K, PR
Here performance gains are not as large as in #514, #515, #516, #518 as k-quants are much faster than sub-4 bpw i-quants. Nevertheless, we see a nearly 50% PP performance improvement compared to the non-interleaved variants, and 5-10% improvement compared to the
_R4variants.