Skip to content

UPSTREAM PR #20635: [CUDA] Increase number of output elements per-thread block if the K-dimension is small#1275

Open
loci-dev wants to merge 3 commits intomainfrom
loci/pr-20635-small_k_optimization
Open

UPSTREAM PR #20635: [CUDA] Increase number of output elements per-thread block if the K-dimension is small#1275
loci-dev wants to merge 3 commits intomainfrom
loci/pr-20635-small_k_optimization

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#20635

The K-dimension (inner dot product dimension) of the FFN-down matrices can be quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3-235B-A22B has a k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such matrices.

This change is also helpful for Tensor parallelism (PR ggml-org/llama.cpp#19378), where FFN-down is split along the K dimension.

Single GPU Performance on 1x RTX Pro 6000 Blackwell
model_type n_ubatch n_prompt master-avg_ts pr-avg_ts Speed-up
qwen3moe 30B.A3B Q4_K - Medium 1 512 231.4418 239.4359 1.03
qwen3moe 30B.A3B Q4_K - Medium 2 512 336.3564 353.3403 1.05
qwen3moe 30B.A3B Q4_K - Medium 4 512 498.7951 544.9048 1.09
qwen3moe 30B.A3B Q4_K - Medium 8 512 579.7136 580.2928 1.00
qwen3moe 30B.A3B Q4_K - Medium 16 512 936.1984 934.2313 1.00
qwen3moe 30B.A3B Q4_K - Medium 32 512 1456.243 1453.281 1.00
qwen3moe 30B.A3B Q4_K - Medium 64 512 2185.851 2185.245 1.00
qwen3moe 30B.A3B Q4_K - Medium 128 512 2970.54 2969.02 1.00
qwen3moe 30B.A3B Q4_K - Medium 256 512 4774.641 4779.619 1.00
qwen3moe 30B.A3B Q4_K - Medium 512 512 6587.268 6592.251 1.00
qwen3moe 30B.A3B Q8_0 1 512 188.6321 189.3348 1.00
qwen3moe 30B.A3B Q8_0 2 512 296.4038 304.8155 1.03
qwen3moe 30B.A3B Q8_0 4 512 446.3545 480.4061 1.08
qwen3moe 30B.A3B Q8_0 8 512 513.8571 513.5698 1.00
qwen3moe 30B.A3B Q8_0 16 512 814.9273 809.3003 0.99
qwen3moe 30B.A3B Q8_0 32 512 1309.532 1310.682 1.00
qwen3moe 30B.A3B Q8_0 64 512 2145.738 2147.491 1.00
qwen3moe 30B.A3B Q8_0 128 512 3039.336 3040.037 1.00
qwen3moe 30B.A3B Q8_0 256 512 4908.882 4912.358 1.00
qwen3moe 30B.A3B Q8_0 512 512 6795.054 6800.975 1.00
qwen3 4B Q4_K - Medium 1 512 270.4391 270.4142 1.00
qwen3 4B Q4_K - Medium 2 512 522.5462 523.2189 1.00
qwen3 4B Q4_K - Medium 4 512 888.7895 891.6788 1.00
qwen3 4B Q4_K - Medium 8 512 1331.554 1333.544 1.00
qwen3 4B Q4_K - Medium 16 512 2609.212 2613.457 1.00
qwen3 4B Q4_K - Medium 32 512 4131.247 4153.166 1.01
qwen3 4B Q4_K - Medium 64 512 6010.69 6040.168 1.00
qwen3 4B Q4_K - Medium 128 512 8336.18 8368.532 1.00
qwen3 4B Q4_K - Medium 256 512 12653.47 12680.27 1.00
qwen3 4B Q4_K - Medium 512 512 16933.91 16990.33 1.00
gpt-oss 20B MXFP4 MoE 1 512 327.1843 327.2503 1.00
gpt-oss 20B MXFP4 MoE 2 512 487.6076 487.2249 1.00
gpt-oss 20B MXFP4 MoE 4 512 722.2551 722.1628 1.00
gpt-oss 20B MXFP4 MoE 8 512 909.277 911.6954 1.00
gpt-oss 20B MXFP4 MoE 16 512 1475.936 1474.678 1.00
gpt-oss 20B MXFP4 MoE 32 512 2448.124 2449.26 1.00
gpt-oss 20B MXFP4 MoE 64 512 4019.604 4021.089 1.00
gpt-oss 20B MXFP4 MoE 128 512 5825.155 5820.645 1.00
gpt-oss 20B MXFP4 MoE 256 512 8901.978 8885.761 1.00
gpt-oss 20B MXFP4 MoE 512 512 11634.01 11628.95 1.00
llama 8B Q4_K - Medium 1 512 218.6202 218.5992 1.00
llama 8B Q4_K - Medium 2 512 426.0842 425.9328 1.00
llama 8B Q4_K - Medium 4 512 753.1047 753.2821 1.00
llama 8B Q4_K - Medium 8 512 1043.164 1042.673 1.00
llama 8B Q4_K - Medium 16 512 2306.093 2301.84 1.00
llama 8B Q4_K - Medium 32 512 3720.924 3730.606 1.00
llama 8B Q4_K - Medium 64 512 5444.508 5457.328 1.00
llama 8B Q4_K - Medium 128 512 7452.762 7408.557 0.99
llama 8B Q4_K - Medium 256 512 10174.56 10179.98 1.00
llama 8B Q4_K - Medium 512 512 12917.97 12923.66 1.00
llama 8B Q4_0 1 512 232.4301 232.74 1.00
llama 8B Q4_0 2 512 461.9919 461.8752 1.00
llama 8B Q4_0 4 512 889.2508 889.4003 1.00
llama 8B Q4_0 8 512 1377.003 1377.244 1.00
llama 8B Q4_0 16 512 2338.211 2335.362 1.00
llama 8B Q4_0 32 512 3822.713 3822.771 1.00
llama 8B Q4_0 64 512 5891.381 5883.02 1.00
llama 8B Q4_0 128 512 7699.878 7715.334 1.00
llama 8B Q4_0 256 512 10874.01 10842.19 1.00
llama 8B Q4_0 512 512 14034.76 14027.07 1.00
Tensor Parallelism Performance on 2x RTX Pro 6000 Blackwell with PR 19378
      ae0334ffa PR Speed-up
2xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 pp512 2165.37 2167.83 1.00
2xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 tg128 71.51 75.37 1.05
2xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 pp512 8357.29 8359.93 1.00
2xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 tg128 182.1 194.26 1.07
4xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 pp512 2367.91 2342.61 0.99
4xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 tg128 66.05 71.5 1.08
4xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 pp512 8408.73 8415.25 1.00
4xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 tg128 155.57 162.79 1.05

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.
@loci-review
Copy link

loci-review bot commented Mar 20, 2026

No meaningful performance changes were detected across 120772 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants