CUDA: MMQ for IQ4_KS #374

ikawrakow · 2025-05-04T07:28:04Z

IQX_K quants offer better quantization quality for the same amount of bits spent compared to k- and i-quants. But on CUDA they are slower for prompt processing (PP) because matrix multiplications are done via dequantize->cuBLAS, so I thought it is time to fix this.

This PR adds quantized matrix multiplications, also known as MMQ, for IQ4_KS.

The following graph shows PP performance as a function of the number of tokens in the KV cache N_KV for the main branch (black) and the PR (red). Model is LLaMA-3.1-8B-Instruct, GPU is RTX-4080. We see a very nice performance improvement in the range of 25%.

Main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.128	3994.38	0.995	128.62
512	128	512	0.091	5635.54	1.003	127.59
512	128	1024	0.093	5526.71	1.016	126.03
512	128	1536	0.095	5405.29	1.030	124.31
512	128	2048	0.096	5308.45	1.046	122.40
512	128	2560	0.098	5237.80	1.061	120.63
512	128	3072	0.101	5079.26	1.079	118.59
512	128	3584	0.101	5052.15	1.095	116.86
512	128	4096	0.103	4965.28	1.113	114.97
512	128	4608	0.105	4883.49	1.128	113.47
512	128	5120	0.107	4783.71	1.152	111.10
512	128	5632	0.109	4713.94	1.158	110.56
512	128	6144	0.110	4644.54	1.171	109.30
512	128	6656	0.112	4573.92	1.184	108.10
512	128	7168	0.114	4498.61	1.198	106.88
512	128	7680	0.116	4421.23	1.211	105.68
512	128	8192	0.118	4345.69	1.225	104.46
512	128	8704	0.120	4279.68	1.239	103.34
512	128	9216	0.121	4220.63	1.253	102.17
512	128	9728	0.123	4151.40	1.281	99.89
512	128	10240	0.125	4088.80	1.293	98.99
512	128	10752	0.127	4034.39	1.297	98.72
512	128	11264	0.129	3963.86	1.308	97.83
512	128	11776	0.130	3927.22	1.321	96.90
512	128	12288	0.132	3864.65	1.334	95.93
512	128	12800	0.135	3803.55	1.350	94.83
512	128	13312	0.136	3753.64	1.363	93.89
512	128	13824	0.138	3698.46	1.379	92.80
512	128	14336	0.140	3649.74	1.392	91.93
512	128	14848	0.142	3600.23	1.418	90.24
512	128	15360	0.145	3531.69	1.429	89.60
512	128	15872	0.146	3496.17	1.442	88.79

PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.107	4778.97	0.995	128.59
512	128	512	0.068	7487.24	1.003	127.58
512	128	1024	0.070	7337.56	1.015	126.16
512	128	1536	0.072	7143.26	1.030	124.23
512	128	2048	0.073	6976.14	1.046	122.32
512	128	2560	0.074	6896.64	1.064	120.30
512	128	3072	0.077	6618.49	1.079	118.68
512	128	3584	0.079	6496.14	1.093	117.06
512	128	4096	0.080	6367.76	1.112	115.14
512	128	4608	0.082	6212.61	1.127	113.61
512	128	5120	0.083	6179.25	1.151	111.17
512	128	5632	0.085	6045.51	1.158	110.55
512	128	6144	0.087	5889.32	1.170	109.43
512	128	6656	0.088	5815.14	1.183	108.18
512	128	7168	0.092	5592.88	1.196	106.99
512	128	7680	0.094	5473.71	1.210	105.76
512	128	8192	0.095	5367.61	1.225	104.51
512	128	8704	0.097	5286.96	1.237	103.50
512	128	9216	0.099	5192.65	1.251	102.35
512	128	9728	0.101	5050.26	1.279	100.07
512	128	10240	0.102	4997.66	1.290	99.19
512	128	10752	0.104	4906.99	1.294	98.90
512	128	11264	0.106	4850.78	1.306	97.98
512	128	11776	0.108	4745.57	1.320	96.97
512	128	12288	0.110	4664.34	1.332	96.09
512	128	12800	0.112	4582.72	1.347	95.00
512	128	13312	0.113	4522.89	1.360	94.09
512	128	13824	0.114	4485.80	1.376	93.02
512	128	14336	0.117	4386.19	1.389	92.13
512	128	14848	0.119	4311.14	1.417	90.32
512	128	15360	0.120	4249.60	1.426	89.74
512	128	15872	0.124	4143.10	1.439	88.94

Are you wondering why PP performance for N_KV = 0 is significantly lower? I did as well, so I checked llama-sweep-bench, the tool with which the data for this graph is generated. Warm-up is done via a single TG run. I checked that if I add another warn-up run with n_ubatch tokens, performance for N_KV = 0 becomes higher than N_KV = 512 as expected. I guess, I will submit a separate PR for that.

TG performance is not affected at all by the PR, so no graph for that.

~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.

saood06 · 2025-05-04T07:33:54Z

I checked that if I add another warn-up run with n_ubatch tokens, performance for N_KV = 0 becomes higher than N_KV = 512 as expected. I guess, I will submit a separate PR for that.

Interesting, I've always dealt with it by either comparing the second row (as it is generally more stable between runs anyways) or just running a very low context sweep-bench as a warmup

ikawrakow · 2025-05-04T07:41:21Z

Interesting, I've always dealt with it by either comparing the second row (as it is generally more stable between runs anyways) or just running a very low context sweep-bench as a warmup

It does not affect CPU performance. But on CUDA the time it takes to find and load the pre-compiled kernels is not negligible when compared to the time for computing a batch (well, at least for the 8B model I used here). I had noticed this peculiar behavior, but as I have been testing mostly MoE models lately I thought it was somehow related to that (we know MoE models do better with larger u-batches).

I'll make the PP warm-up pass optional via a command line argument as for very large models on the CPU it does take some time to process a batch of 512 tokens.

saood06 · 2025-05-04T07:52:57Z

It does not affect CPU performance.

I just looked back at my notes/logs, it is the first TG for CPU that does vary, and the cause is different as there is corresponding disk activity that is almost certainly to blame (very little but still some, and even a single HDD seek can sometime be seen from the numbers in my experience). I have done GPU speed testing but I generally don't look at the PP results especially not at low contexts so I never reran to see it go away.

I'll make the PP warm-up pass optional via a command line argument as for very large models on the CPU it does take some time to process a batch of 512 tokens.

Thanks, I was going to suggest that, as that is very true for some of my testing.

ubergarm · 2025-05-07T22:02:48Z

I'm working on some benchmarks for various Qwen3-30B-A3B quants and ran some llama-sweep-benches and this PR is looking good for your IQ4_KS. Used the --warmup-batch PR as well.

ik_llama.cpp

mainline

Iwan Kawrakow added 3 commits May 4, 2025 09:19

WIP

1d24947

WIP: still getting illegal memory access

ca7b671

CUDA: MMQ for iq4_ks now works

3498ea4

~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.

ikawrakow mentioned this pull request May 4, 2025

Add batch warmup to sweep-bench #375

Merged

ikawrakow merged commit f7c9a0f into main May 4, 2025

ubergarm mentioned this pull request May 13, 2025

Bug: Loading DeepSeek R1T Chimera causes "llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1" #383

Closed

ikawrakow mentioned this pull request May 14, 2025

CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K #417

Merged

This was referenced Aug 8, 2025

Bug: GGML_ASSERT when running quantized K cache on CUDA with no fa #679

Closed

Fix quantized K cache without FA #680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: MMQ for IQ4_KS #374

CUDA: MMQ for IQ4_KS #374

Uh oh!

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025

Uh oh!

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025 •

edited

Loading

Uh oh!

ubergarm commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUDA: MMQ for IQ4_KS #374

CUDA: MMQ for IQ4_KS #374

Uh oh!

Conversation

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025

Uh oh!

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented May 7, 2025

ik_llama.cpp

mainline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saood06 commented May 4, 2025 •

edited

Loading