Patch perf regression for mmq kernels in ROCm by jiachengjason · Pull Request #18442 · ggml-org/llama.cpp

jiachengjason · 2025-12-28T23:42:11Z

Recover performance regression for #17917 for RDNA3 and 4 by choosing more performant configs for mmq kernels

Using similar approach as CDNA's config for now to patch the regression, will optimize further in upcoming PRs

Strix Halo performance after patch

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	pp2048	1001.53 ± 32.41
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	tg32	51.83 ± 0.10
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	pp2048 @ d4096	769.89 ± 2.73
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	tg32 @ d4096	42.45 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	pp2048 @ d8192	595.15 ± 1.66
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	tg32 @ d8192	36.75 ± 0.06
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	pp2048 @ d16384	402.48 ± 0.27
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	tg32 @ d16384	29.10 ± 0.12
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	pp2048 @ d32768	224.79 ± 0.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	2048	1	tg32 @ d32768	19.99 ± 0.02

recover performance regression for ggml-org#17917

Beinsezii · 2025-12-29T09:04:33Z

did a quick compile on gfx1100 looks like small 24b @ q6k matches pre #17576 perf now without forcing cublas.

qwen 30b3a unsloth q5kxl lost 40% of its pp vs mmq, might need a n_experts branch like the cdna path?

Beinsezii · 2025-12-29T09:15:13Z

Naively copying the CDNA path as presented in #18202

diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
index 1a297971..0a0d440a 100644
--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
@@ -333,7 +333,10 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
     }
 
     if (amd_wmma_available(cc)) {
-        if (ne11 <= 128 || type == GGML_TYPE_Q4_0 || type == GGML_TYPE_Q4_1 || type == GGML_TYPE_Q5_0 || type == GGML_TYPE_Q5_1) {
+        if (n_experts > 64 || ne11 <= 128) {
+            return true;
+        }
+        if (type == GGML_TYPE_Q4_0 || type == GGML_TYPE_Q4_1 || type == GGML_TYPE_Q5_0 || type == GGML_TYPE_Q5_1) {
             return true;
         }
         if (ne11 <= 256 && (type == GGML_TYPE_Q4_K || type == GGML_TYPE_Q5_K)) {

makes this positive or equal on all the models I have happen to have on-disk, but I'm not sure if RDNA should have a different exp count or not.

looking through #14949 seems like this block was done by pretty much just A|B testing every quant type which I don't have time for today, but likely this needs to be re-done for RDNA

Beinsezii · 2026-01-02T07:34:45Z

For my new year's resolution I've picked this up and created more granular switching in #18537 based on results from semi-automated benchmarks. Might want to check it on Strix Halo @jiachengjason

JohannesGaessler · 2026-01-02T09:39:48Z

The testing methodology that I am using for matrix multiplications is something like this:

export mn=llama_3-8b
for q in q4_0 q4_1 q5_0 q5_1 q8_0 q2_k_s q3_k_s q4_k_s q5_k_s q6_k iq1_s iq2_xxs iq2_xs iq2_s iq3_xxs iq3_xs iq3_s iq3_m iq4_nl iq4_xs; do echo $q; ./bench --model models/opt/${mn}-${q}.gguf -r 1 -fa 1 -n 0 -p 2048 -ub "1-2048*2" --progress -o sql|sqlite3 llama-bench.sqlite; sleep 10; done

After doing the above for 2 commits,

scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite

can be used to create a table that compares the performance. That is the only format in which I will accept evidence for changes in performance. But previously similar PRs inadvertently resulted in performance regressions so I will not approve this or the other PR anyways unless I can confirm the numbers myself. But I cannot do that until Monday because the Strix Halo system that AMD sent me comes with only a single USB-A port.

elfarolab · 2026-01-02T11:20:39Z

@JohannesGaessler

sorry for the interruption, ./bench utility is tools/llama-bench , tools/server/bench/bench.py or rather some other?

Thank you so much.

Beinsezii · 2026-01-02T11:53:52Z

The testing methodology that I am using for matrix multiplications is something like this

Probably worth mentioning in CONTRIBUTING.md? I updated my PR #18537

sorry for the interruption, ./bench utility is tools/llama-bench , tools/server/bench/bench.py or rather some other?

the llama-bench make target. elf binary.

JohannesGaessler · 2026-01-02T13:42:48Z

sorry for the interruption, ./bench utility is tools/llama-bench , tools/server/bench/bench.py or rather some other?

Sorry, that is a symlink to build/bin/llama-bench and I forgot to replace it when copy-pasting.

Probably worth mentioning in CONTRIBUTING.md?

To be clear, when I said "only format" I meant llama-bench. And that is the standard that I have when reviewing PRs but not necessarily the standard that other maintainers have.

JohannesGaessler · 2026-01-07T13:59:14Z

Superseded by #18537 .

Patch perf regression for mmq kernels in ROCm

b75c391

recover performance regression for ggml-org#17917

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 28, 2025

loci-dev mentioned this pull request Dec 29, 2025

UPSTREAM PR #18442: Patch perf regression for mmq kernels in ROCm auroralabs-loci/llama.cpp#736

Open

add n_experts branch like the cdna path

ec259f9

jiachengjason marked this pull request as ready for review December 29, 2025 17:03

jiachengjason requested a review from JohannesGaessler as a code owner December 29, 2025 17:03

Beinsezii mentioned this pull request Jan 2, 2026

mmq.cu: tune mmq/rocblas switching for RDNA #18537

Merged

JohannesGaessler closed this Jan 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch perf regression for mmq kernels in ROCm#18442

Patch perf regression for mmq kernels in ROCm#18442
jiachengjason wants to merge 2 commits intoggml-org:masterfrom
jiachengjason:fix/jiachengjason/rocm7.x_regression

jiachengjason commented Dec 28, 2025 •

edited

Loading

Uh oh!

Beinsezii commented Dec 29, 2025 •

edited

Loading

Uh oh!

Beinsezii commented Dec 29, 2025 •

edited

Loading

Uh oh!

Beinsezii commented Jan 2, 2026

Uh oh!

JohannesGaessler commented Jan 2, 2026

Uh oh!

elfarolab commented Jan 2, 2026

Uh oh!

Beinsezii commented Jan 2, 2026

Uh oh!

JohannesGaessler commented Jan 2, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jiachengjason commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Beinsezii commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Beinsezii commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Beinsezii commented Jan 2, 2026

Uh oh!

JohannesGaessler commented Jan 2, 2026

Uh oh!

elfarolab commented Jan 2, 2026

Uh oh!

Beinsezii commented Jan 2, 2026

Uh oh!

JohannesGaessler commented Jan 2, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiachengjason commented Dec 28, 2025 •

edited

Loading

Beinsezii commented Dec 29, 2025 •

edited

Loading

Beinsezii commented Dec 29, 2025 •

edited

Loading