Skip to content

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1009

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19053-branch_ORippler-osimons/fix_bw_mmq_fixup_kernel
Open

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1009
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19053-branch_ORippler-osimons/fix_bw_mmq_fixup_kernel

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#19053

By providing stride_* variables as size_t (i.e., 64-bit), the compiler can correctly unroll the two for-loops on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs.

For pointer arithmetic inside loops, general performance guidance moving forward is likely to be to perform it in 64-bit unless strictly necessary.

Perf numbers
GPU Model Test t/s master t/s osimons/fix_bw_mmq_fixup_kernel Speedup
NVIDIA RTX 6000 Ada Generation gpt-oss 20B MXFP4 MoE pp8096 8404.05 8375.79 1.00
NVIDIA RTX 6000 Ada Generation gpt-oss 20B MXFP4 MoE tg128 253.79 253.90 1.00
NVIDIA RTX 6000 Ada Generation llama 3B Q4_K_M pp8096 16148.93 16019.60 0.99
NVIDIA RTX 6000 Ada Generation llama 3B Q4_K_M tg128 315.50 315.08 1.00
NVIDIA RTX 6000 Ada Generation llama 8B Q4_0 pp8096 8008.29 7978.80 1.00
NVIDIA RTX 6000 Ada Generation llama 8B Q4_0 tg128 168.87 168.85 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B BF16 pp8096 4263.16 4248.53 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B BF16 tg128 48.61 48.59 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B Q4_K_M pp8096 5165.11 5157.43 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B Q4_K_M tg128 111.54 111.47 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition gpt-oss 20B MXFP4 MoE pp8096 12582.80 12758.37 1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition gpt-oss 20B MXFP4 MoE tg128 352.58 353.16 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 3B Q4_K_M pp8096 16879.10 17619.47 1.04
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 3B Q4_K_M tg128 426.27 425.65 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 8B Q4_0 pp8096 10649.90 10982.65 1.03
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 8B Q4_0 tg128 260.32 260.25 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B BF16 pp8096 7717.73 7716.22 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B BF16 tg128 83.51 83.51 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B Q4_K_M pp8096 7301.90 7370.38 1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B Q4_K_M tg128 172.99 172.78 1.00

By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](https://github.com/ggml-org/llama.cpp/blob/557515be1e93ed8939dd8a7c7d08765fdbe8be31/ggml/src/ggml-cuda/mmq.cuh#L3789-L3816)
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

| GPU                                                     | Model                 | Test   |   t/s master |   t/s osimons/fix_bw_mmq_fixup_kernel |   Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation                          | gpt-oss 20B MXFP4 MoE | pp8096 |      8404.05 |                               8375.79 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | llama 3B Q4_K_M       | pp8096 |     16148.93 |                              16019.60 |      0.99 |
| NVIDIA RTX 6000 Ada Generation                          | llama 8B Q4_0         | pp8096 |      8008.29 |                               7978.80 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B BF16    | pp8096 |      4263.16 |                               4248.53 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B Q4_K_M  | pp8096 |      5165.11 |                               5157.43 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 |     12582.80 |                              12758.37 |      1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M       | pp8096 |     16879.10 |                              17619.47 |      1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0         | pp8096 |     10649.90 |                              10982.65 |      1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16    | pp8096 |      7717.73 |                               7716.22 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M  | pp8096 |      7301.90 |                               7370.38 |      1.01 |
@loci-review
Copy link

loci-review bot commented Jan 23, 2026

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The function_insights_topk tool returned empty results for both response time and throughput time metrics, indicating that the code changes in this version do not introduce measurable performance impacts.

This suggests that the modifications between versions are either:

  • Non-performance-affecting changes (documentation, comments, formatting)
  • Refactoring that maintains equivalent performance characteristics
  • Changes to non-critical code paths with negligible execution time
  • Bug fixes or feature additions that don't alter the hot path execution

Without significant performance deltas to analyze, no further investigation into specific functions, power consumption changes, or execution path differences is warranted for this version comparison.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from a50395f to 8587aee Compare January 27, 2026 19:14
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 5fea2ef to 8a7ef20 Compare January 31, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants