UPSTREAM PR #20635: [CUDA] Increase number of output elements per-thread block if the K-dimension is small by loci-dev · Pull Request #1275 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-20T02:17:07Z

Note

Source pull request: ggml-org/llama.cpp#20635

The K-dimension (inner dot product dimension) of the FFN-down matrices can be quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3-235B-A22B has a k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such matrices.

This change is also helpful for Tensor parallelism (PR ggml-org/llama.cpp#19378), where FFN-down is split along the K dimension.

Single GPU Performance on 1x RTX Pro 6000 Blackwell

model_type	n_ubatch	n_prompt	master-avg_ts	pr-avg_ts	Speed-up
qwen3moe 30B.A3B Q4_K - Medium	1	512	231.4418	239.4359	1.03
qwen3moe 30B.A3B Q4_K - Medium	2	512	336.3564	353.3403	1.05
qwen3moe 30B.A3B Q4_K - Medium	4	512	498.7951	544.9048	1.09
qwen3moe 30B.A3B Q4_K - Medium	8	512	579.7136	580.2928	1.00
qwen3moe 30B.A3B Q4_K - Medium	16	512	936.1984	934.2313	1.00
qwen3moe 30B.A3B Q4_K - Medium	32	512	1456.243	1453.281	1.00
qwen3moe 30B.A3B Q4_K - Medium	64	512	2185.851	2185.245	1.00
qwen3moe 30B.A3B Q4_K - Medium	128	512	2970.54	2969.02	1.00
qwen3moe 30B.A3B Q4_K - Medium	256	512	4774.641	4779.619	1.00
qwen3moe 30B.A3B Q4_K - Medium	512	512	6587.268	6592.251	1.00
qwen3moe 30B.A3B Q8_0	1	512	188.6321	189.3348	1.00
qwen3moe 30B.A3B Q8_0	2	512	296.4038	304.8155	1.03
qwen3moe 30B.A3B Q8_0	4	512	446.3545	480.4061	1.08
qwen3moe 30B.A3B Q8_0	8	512	513.8571	513.5698	1.00
qwen3moe 30B.A3B Q8_0	16	512	814.9273	809.3003	0.99
qwen3moe 30B.A3B Q8_0	32	512	1309.532	1310.682	1.00
qwen3moe 30B.A3B Q8_0	64	512	2145.738	2147.491	1.00
qwen3moe 30B.A3B Q8_0	128	512	3039.336	3040.037	1.00
qwen3moe 30B.A3B Q8_0	256	512	4908.882	4912.358	1.00
qwen3moe 30B.A3B Q8_0	512	512	6795.054	6800.975	1.00
qwen3 4B Q4_K - Medium	1	512	270.4391	270.4142	1.00
qwen3 4B Q4_K - Medium	2	512	522.5462	523.2189	1.00
qwen3 4B Q4_K - Medium	4	512	888.7895	891.6788	1.00
qwen3 4B Q4_K - Medium	8	512	1331.554	1333.544	1.00
qwen3 4B Q4_K - Medium	16	512	2609.212	2613.457	1.00
qwen3 4B Q4_K - Medium	32	512	4131.247	4153.166	1.01
qwen3 4B Q4_K - Medium	64	512	6010.69	6040.168	1.00
qwen3 4B Q4_K - Medium	128	512	8336.18	8368.532	1.00
qwen3 4B Q4_K - Medium	256	512	12653.47	12680.27	1.00
qwen3 4B Q4_K - Medium	512	512	16933.91	16990.33	1.00
gpt-oss 20B MXFP4 MoE	1	512	327.1843	327.2503	1.00
gpt-oss 20B MXFP4 MoE	2	512	487.6076	487.2249	1.00
gpt-oss 20B MXFP4 MoE	4	512	722.2551	722.1628	1.00
gpt-oss 20B MXFP4 MoE	8	512	909.277	911.6954	1.00
gpt-oss 20B MXFP4 MoE	16	512	1475.936	1474.678	1.00
gpt-oss 20B MXFP4 MoE	32	512	2448.124	2449.26	1.00
gpt-oss 20B MXFP4 MoE	64	512	4019.604	4021.089	1.00
gpt-oss 20B MXFP4 MoE	128	512	5825.155	5820.645	1.00
gpt-oss 20B MXFP4 MoE	256	512	8901.978	8885.761	1.00
gpt-oss 20B MXFP4 MoE	512	512	11634.01	11628.95	1.00
llama 8B Q4_K - Medium	1	512	218.6202	218.5992	1.00
llama 8B Q4_K - Medium	2	512	426.0842	425.9328	1.00
llama 8B Q4_K - Medium	4	512	753.1047	753.2821	1.00
llama 8B Q4_K - Medium	8	512	1043.164	1042.673	1.00
llama 8B Q4_K - Medium	16	512	2306.093	2301.84	1.00
llama 8B Q4_K - Medium	32	512	3720.924	3730.606	1.00
llama 8B Q4_K - Medium	64	512	5444.508	5457.328	1.00
llama 8B Q4_K - Medium	128	512	7452.762	7408.557	0.99
llama 8B Q4_K - Medium	256	512	10174.56	10179.98	1.00
llama 8B Q4_K - Medium	512	512	12917.97	12923.66	1.00
llama 8B Q4_0	1	512	232.4301	232.74	1.00
llama 8B Q4_0	2	512	461.9919	461.8752	1.00
llama 8B Q4_0	4	512	889.2508	889.4003	1.00
llama 8B Q4_0	8	512	1377.003	1377.244	1.00
llama 8B Q4_0	16	512	2338.211	2335.362	1.00
llama 8B Q4_0	32	512	3822.713	3822.771	1.00
llama 8B Q4_0	64	512	5891.381	5883.02	1.00
llama 8B Q4_0	128	512	7699.878	7715.334	1.00
llama 8B Q4_0	256	512	10874.01	10842.19	1.00
llama 8B Q4_0	512	512	14034.76	14027.07	1.00

Tensor Parallelism Performance on 2x RTX Pro 6000 Blackwell with PR 19378

			ae0334ffa	PR	Speed-up
2xRTX 6000 Pro BW	Qwen3-235B-A22B-Q4_0	pp512	2165.37	2167.83	1.00
2xRTX 6000 Pro BW	Qwen3-235B-A22B-Q4_0	tg128	71.51	75.37	1.05
2xRTX 6000 Pro BW	Qwen3-30B-A3B-Q4_0	pp512	8357.29	8359.93	1.00
2xRTX 6000 Pro BW	Qwen3-30B-A3B-Q4_0	tg128	182.1	194.26	1.07
4xRTX 6000 Pro BW	Qwen3-235B-A22B-Q4_0	pp512	2367.91	2342.61	0.99
4xRTX 6000 Pro BW	Qwen3-235B-A22B-Q4_0	tg128	66.05	71.5	1.08
4xRTX 6000 Pro BW	Qwen3-30B-A3B-Q4_0	pp512	8408.73	8415.25	1.00
4xRTX 6000 Pro BW	Qwen3-30B-A3B-Q4_0	tg128	155.57	162.79	1.05

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change increases the number of output elements per block for such cases.

loci-review · 2026-03-20T03:06:09Z

No meaningful performance changes were detected across 120772 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

gaugarg-nv added 3 commits March 18, 2026 16:56

Limit this change to ncols_dst = 1

6374ae0

tab to space

fd9e334

loci-dev temporarily deployed to PROD__AL_DEMO March 20, 2026 02:17 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #20635: [CUDA] Increase number of output elements per-thread block if the K-dimension is small#1275

UPSTREAM PR #20635: [CUDA] Increase number of output elements per-thread block if the K-dimension is small#1275
loci-dev wants to merge 3 commits intomainfrom
loci/pr-20635-small_k_optimization

loci-dev commented Mar 20, 2026

Uh oh!

loci-review bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 20, 2026

Uh oh!

loci-review bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants