UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup by loci-dev · Pull Request #1107 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-31T09:41:41Z

Note

Source pull request: ggml-org/llama.cpp#19053

By providing stride_* variables as size_t (i.e., 64-bit), the compiler can correctly unroll the two for-loops on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs.

For pointer arithmetic inside loops, general performance guidance moving forward is likely to be to perform it in 64-bit unless strictly necessary.

Perf numbers

GPU	Model	Test	t/s master	t/s osimons/fix_bw_mmq_fixup_kernel	Speedup
NVIDIA RTX 6000 Ada Generation	gpt-oss 20B MXFP4 MoE	pp8096	8404.05	8375.79	1.00
NVIDIA RTX 6000 Ada Generation	gpt-oss 20B MXFP4 MoE	tg128	253.79	253.90	1.00
NVIDIA RTX 6000 Ada Generation	llama 3B Q4_K_M	pp8096	16148.93	16019.60	0.99
NVIDIA RTX 6000 Ada Generation	llama 3B Q4_K_M	tg128	315.50	315.08	1.00
NVIDIA RTX 6000 Ada Generation	llama 8B Q4_0	pp8096	8008.29	7978.80	1.00
NVIDIA RTX 6000 Ada Generation	llama 8B Q4_0	tg128	168.87	168.85	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B BF16	pp8096	4263.16	4248.53	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B BF16	tg128	48.61	48.59	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B Q4_K_M	pp8096	5165.11	5157.43	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B Q4_K_M	tg128	111.54	111.47	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	gpt-oss 20B MXFP4 MoE	pp8096	12582.80	12758.37	1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	gpt-oss 20B MXFP4 MoE	tg128	352.58	353.16	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 3B Q4_K_M	pp8096	16879.10	17619.47	1.04
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 3B Q4_K_M	tg128	426.27	425.65	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 8B Q4_0	pp8096	10649.90	10982.65	1.03
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 8B Q4_0	tg128	260.32	260.25	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B BF16	pp8096	7717.73	7716.22	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B BF16	tg128	83.51	83.51	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B Q4_K_M	pp8096	7301.90	7370.38	1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B Q4_K_M	tg128	172.99	172.78	1.00

By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](https://github.com/ggml-org/llama.cpp/blob/557515be1e93ed8939dd8a7c7d08765fdbe8be31/ggml/src/ggml-cuda/mmq.cuh#L3789-L3816) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: | GPU | Model | Test | t/s master | t/s osimons/fix_bw_mmq_fixup_kernel | Speedup | |:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:| | NVIDIA RTX 6000 Ada Generation | gpt-oss 20B MXFP4 MoE | pp8096 | 8404.05 | 8375.79 | 1.00 | | NVIDIA RTX 6000 Ada Generation | llama 3B Q4_K_M | pp8096 | 16148.93 | 16019.60 | 0.99 | | NVIDIA RTX 6000 Ada Generation | llama 8B Q4_0 | pp8096 | 8008.29 | 7978.80 | 1.00 | | NVIDIA RTX 6000 Ada Generation | nemotron_h 9B BF16 | pp8096 | 4263.16 | 4248.53 | 1.00 | | NVIDIA RTX 6000 Ada Generation | nemotron_h 9B Q4_K_M | pp8096 | 5165.11 | 5157.43 | 1.00 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 | 12582.80 | 12758.37 | 1.01 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M | pp8096 | 16879.10 | 17619.47 | 1.04 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0 | pp8096 | 10649.90 | 10982.65 | 1.03 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16 | pp8096 | 7717.73 | 7716.22 | 1.00 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M | pp8096 | 7301.90 | 7370.38 | 1.01 |

loci-review · 2026-01-31T10:29:16Z

No meaningful performance changes were detected across 112622 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-bench, build.bin.libggml.so, build.bin.libggml-cpu.so, build.bin.libggml-base.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-dev temporarily deployed to PROD__AL_DEMO January 31, 2026 09:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 8c82563 to 2f72634 Compare January 31, 2026 10:09

loci-dev force-pushed the main branch 26 times, most recently from d613f70 to 6a853c2 Compare February 1, 2026 13:22

loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1107

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1107
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19053-osimons-fix_bw_mmq_fixup_kernel

loci-dev commented Jan 31, 2026

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 31, 2026

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants