Faster CPU prompt processing for Q4_K and Q5_K #525

ikawrakow · 2025-06-12T16:42:56Z

These two quantization types are quite popular, so I thought it makes sense to improve their performance. The repacked variants Q4_K_R4 and Q5_K_R4 do not have a CUDA implementation, so repacking is not useful in a hybrid CPU/GPU setup where it may be better to offload tensors stored in RAM to the GPU when processing large batched.

The PR uses the same trick as #515, #516, #517, #518. When processing batches >= 32 tokens, Q4_K or Q5_K quantized tensors are repacked on-the-fly to Q8_1_R8.

Here some sweep-bench results for LLaMA-3.1-8B-Instruct on a Ryzen-7950X CPU

Q4_K, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.853	179.49	9.792	13.07
512	128	512	2.745	186.52	10.119	12.65
512	128	1024	2.806	182.49	10.118	12.65
512	128	1536	2.905	176.22	10.273	12.46
512	128	2048	3.434	149.08	10.492	12.20

Q4_K_R4

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.015	254.10	9.808	13.05
512	128	512	2.051	249.65	9.992	12.81
512	128	1024	2.131	240.28	10.145	12.62
512	128	1536	2.247	227.84	10.297	12.43
512	128	2048	2.338	219.02	10.478	12.22

Q4_K, PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.903	269.00	9.719	13.17
512	128	512	1.974	259.37	9.975	12.83
512	128	1024	2.004	255.47	10.024	12.77
512	128	1536	2.351	217.73	10.033	12.76
512	128	2048	2.114	242.19	10.150	12.61

Q5_K, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.894	176.89	11.650	10.99
512	128	512	3.461	147.93	11.760	10.88
512	128	1024	2.986	171.44	11.818	10.83
512	128	1536	3.026	169.22	11.875	10.78
512	128	2048	3.172	161.39	11.967	10.70

Q5_K_R4

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.149	238.30	11.712	10.93
512	128	512	2.189	233.89	11.899	10.76
512	128	1024	2.269	225.62	11.953	10.71
512	128	1536	2.328	219.90	12.044	10.63
512	128	2048	2.343	218.54	12.050	10.62

Q5_K, PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.929	265.41	11.599	11.04
512	128	512	2.042	250.69	11.810	10.84
512	128	1024	2.051	249.64	11.888	10.77
512	128	1536	2.350	217.91	11.888	10.77
512	128	2048	2.133	240.00	11.998	10.67

Here performance gains are not as large as in #514, #515, #516, #518 as k-quants are much faster than sub-4 bpw i-quants. Nevertheless, we see a nearly 50% PP performance improvement compared to the non-interleaved variants, and 5-10% improvement compared to the _R4 variants.

We get 268 t/s, up from 186 t/s.

Iwan Kawrakow added 4 commits June 12, 2025 19:26

q4_K: dequantize to q8_1_r8 for batch >= 32

8de4c01

We get 268 t/s, up from 186 t/s.

q4_K: GEMM with q8_2_X4

4b8f765

q5_K: GEMM with q8_2_X4 and repack to q8_1_r8

5432108

Remove the scales, they are not needed

8ba852a

ikawrakow merged commit 066ed4f into main Jun 13, 2025

ikawrakow mentioned this pull request Jun 14, 2025

Fix bug introduced in #524/#525 #528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster CPU prompt processing for Q4_K and Q5_K #525

Faster CPU prompt processing for Q4_K and Q5_K #525

Uh oh!

ikawrakow commented Jun 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster CPU prompt processing for Q4_K and Q5_K #525

Faster CPU prompt processing for Q4_K and Q5_K #525

Uh oh!

Conversation

ikawrakow commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Q4_K, main branch

Q4_K_R4

Q4_K, PR

Q5_K, main branch

Q5_K_R4

Q5_K, PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikawrakow commented Jun 12, 2025 •

edited

Loading