UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup by loci-dev · Pull Request #1009 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-23T17:39:11Z

Mirrored from ggml-org/llama.cpp#19053

By providing stride_* variables as size_t (i.e., 64-bit), the compiler can correctly unroll the two for-loops on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs.

For pointer arithmetic inside loops, general performance guidance moving forward is likely to be to perform it in 64-bit unless strictly necessary.

Perf numbers

GPU	Model	Test	t/s master	t/s osimons/fix_bw_mmq_fixup_kernel	Speedup
NVIDIA RTX 6000 Ada Generation	gpt-oss 20B MXFP4 MoE	pp8096	8404.05	8375.79	1.00
NVIDIA RTX 6000 Ada Generation	gpt-oss 20B MXFP4 MoE	tg128	253.79	253.90	1.00
NVIDIA RTX 6000 Ada Generation	llama 3B Q4_K_M	pp8096	16148.93	16019.60	0.99
NVIDIA RTX 6000 Ada Generation	llama 3B Q4_K_M	tg128	315.50	315.08	1.00
NVIDIA RTX 6000 Ada Generation	llama 8B Q4_0	pp8096	8008.29	7978.80	1.00
NVIDIA RTX 6000 Ada Generation	llama 8B Q4_0	tg128	168.87	168.85	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B BF16	pp8096	4263.16	4248.53	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B BF16	tg128	48.61	48.59	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B Q4_K_M	pp8096	5165.11	5157.43	1.00
NVIDIA RTX 6000 Ada Generation	nemotron_h 9B Q4_K_M	tg128	111.54	111.47	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	gpt-oss 20B MXFP4 MoE	pp8096	12582.80	12758.37	1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	gpt-oss 20B MXFP4 MoE	tg128	352.58	353.16	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 3B Q4_K_M	pp8096	16879.10	17619.47	1.04
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 3B Q4_K_M	tg128	426.27	425.65	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 8B Q4_0	pp8096	10649.90	10982.65	1.03
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	llama 8B Q4_0	tg128	260.32	260.25	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B BF16	pp8096	7717.73	7716.22	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B BF16	tg128	83.51	83.51	1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B Q4_K_M	pp8096	7301.90	7370.38	1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	nemotron_h 9B Q4_K_M	tg128	172.99	172.78	1.00

By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](https://github.com/ggml-org/llama.cpp/blob/557515be1e93ed8939dd8a7c7d08765fdbe8be31/ggml/src/ggml-cuda/mmq.cuh#L3789-L3816) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: | GPU | Model | Test | t/s master | t/s osimons/fix_bw_mmq_fixup_kernel | Speedup | |:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:| | NVIDIA RTX 6000 Ada Generation | gpt-oss 20B MXFP4 MoE | pp8096 | 8404.05 | 8375.79 | 1.00 | | NVIDIA RTX 6000 Ada Generation | llama 3B Q4_K_M | pp8096 | 16148.93 | 16019.60 | 0.99 | | NVIDIA RTX 6000 Ada Generation | llama 8B Q4_0 | pp8096 | 8008.29 | 7978.80 | 1.00 | | NVIDIA RTX 6000 Ada Generation | nemotron_h 9B BF16 | pp8096 | 4263.16 | 4248.53 | 1.00 | | NVIDIA RTX 6000 Ada Generation | nemotron_h 9B Q4_K_M | pp8096 | 5165.11 | 5157.43 | 1.00 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 | 12582.80 | 12758.37 | 1.01 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M | pp8096 | 16879.10 | 17619.47 | 1.04 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0 | pp8096 | 10649.90 | 10982.65 | 1.03 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16 | pp8096 | 7717.73 | 7716.22 | 1.00 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M | pp8096 | 7301.90 | 7370.38 | 1.01 |

loci-review · 2026-01-23T18:41:35Z

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The function_insights_topk tool returned empty results for both response time and throughput time metrics, indicating that the code changes in this version do not introduce measurable performance impacts.

This suggests that the modifications between versions are either:

Non-performance-affecting changes (documentation, comments, formatting)
Refactoring that maintains equivalent performance characteristics
Changes to non-critical code paths with negligible execution time
Bug fixes or feature additions that don't alter the hot path execution

Without significant performance deltas to analyze, no further investigation into specific functions, power consumption changes, or execution path differences is warranted for this version comparison.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

loci-dev temporarily deployed to PROD__AL_DEMO January 23, 2026 17:39 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from a50395f to 8587aee Compare January 27, 2026 19:14

loci-dev force-pushed the main branch 30 times, most recently from 5fea2ef to 8a7ef20 Compare January 31, 2026 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1009

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1009
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19053-branch_ORippler-osimons/fix_bw_mmq_fixup_kernel

loci-dev commented Jan 23, 2026

Uh oh!

loci-review bot commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 23, 2026

Uh oh!

loci-review bot commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants