Skip to content

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1107

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19053-osimons-fix_bw_mmq_fixup_kernel
Open

UPSTREAM PR #19053: CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup#1107
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19053-osimons-fix_bw_mmq_fixup_kernel

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19053

By providing stride_* variables as size_t (i.e., 64-bit), the compiler can correctly unroll the two for-loops on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs.

For pointer arithmetic inside loops, general performance guidance moving forward is likely to be to perform it in 64-bit unless strictly necessary.

Perf numbers
GPU Model Test t/s master t/s osimons/fix_bw_mmq_fixup_kernel Speedup
NVIDIA RTX 6000 Ada Generation gpt-oss 20B MXFP4 MoE pp8096 8404.05 8375.79 1.00
NVIDIA RTX 6000 Ada Generation gpt-oss 20B MXFP4 MoE tg128 253.79 253.90 1.00
NVIDIA RTX 6000 Ada Generation llama 3B Q4_K_M pp8096 16148.93 16019.60 0.99
NVIDIA RTX 6000 Ada Generation llama 3B Q4_K_M tg128 315.50 315.08 1.00
NVIDIA RTX 6000 Ada Generation llama 8B Q4_0 pp8096 8008.29 7978.80 1.00
NVIDIA RTX 6000 Ada Generation llama 8B Q4_0 tg128 168.87 168.85 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B BF16 pp8096 4263.16 4248.53 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B BF16 tg128 48.61 48.59 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B Q4_K_M pp8096 5165.11 5157.43 1.00
NVIDIA RTX 6000 Ada Generation nemotron_h 9B Q4_K_M tg128 111.54 111.47 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition gpt-oss 20B MXFP4 MoE pp8096 12582.80 12758.37 1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition gpt-oss 20B MXFP4 MoE tg128 352.58 353.16 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 3B Q4_K_M pp8096 16879.10 17619.47 1.04
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 3B Q4_K_M tg128 426.27 425.65 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 8B Q4_0 pp8096 10649.90 10982.65 1.03
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition llama 8B Q4_0 tg128 260.32 260.25 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B BF16 pp8096 7717.73 7716.22 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B BF16 tg128 83.51 83.51 1.00
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B Q4_K_M pp8096 7301.90 7370.38 1.01
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition nemotron_h 9B Q4_K_M tg128 172.99 172.78 1.00

By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](https://github.com/ggml-org/llama.cpp/blob/557515be1e93ed8939dd8a7c7d08765fdbe8be31/ggml/src/ggml-cuda/mmq.cuh#L3789-L3816)
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

| GPU                                                     | Model                 | Test   |   t/s master |   t/s osimons/fix_bw_mmq_fixup_kernel |   Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation                          | gpt-oss 20B MXFP4 MoE | pp8096 |      8404.05 |                               8375.79 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | llama 3B Q4_K_M       | pp8096 |     16148.93 |                              16019.60 |      0.99 |
| NVIDIA RTX 6000 Ada Generation                          | llama 8B Q4_0         | pp8096 |      8008.29 |                               7978.80 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B BF16    | pp8096 |      4263.16 |                               4248.53 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B Q4_K_M  | pp8096 |      5165.11 |                               5157.43 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 |     12582.80 |                              12758.37 |      1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M       | pp8096 |     16879.10 |                              17619.47 |      1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0         | pp8096 |     10649.90 |                              10982.65 |      1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16    | pp8096 |      7717.73 |                               7716.22 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M  | pp8096 |      7301.90 |                               7370.38 |      1.01 |
@loci-review
Copy link

loci-review bot commented Jan 31, 2026

No meaningful performance changes were detected across 112622 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-bench, build.bin.libggml.so, build.bin.libggml-cpu.so, build.bin.libggml-base.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from d613f70 to 6a853c2 Compare February 1, 2026 13:22
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants