Skip to content

Patch perf regression for mmq kernels in ROCm#18442

Closed
jiachengjason wants to merge 2 commits intoggml-org:masterfrom
jiachengjason:fix/jiachengjason/rocm7.x_regression
Closed

Patch perf regression for mmq kernels in ROCm#18442
jiachengjason wants to merge 2 commits intoggml-org:masterfrom
jiachengjason:fix/jiachengjason/rocm7.x_regression

Conversation

@jiachengjason
Copy link
Contributor

@jiachengjason jiachengjason commented Dec 28, 2025

Recover performance regression for #17917 for RDNA3 and 4 by choosing more performant configs for mmq kernels

Using similar approach as CDNA's config for now to patch the regression, will optimize further in upcoming PRs

Strix Halo performance after patch

model size params backend n_ubatch fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 pp2048 1001.53 ± 32.41
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 tg32 51.83 ± 0.10
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 pp2048 @ d4096 769.89 ± 2.73
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 tg32 @ d4096 42.45 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 pp2048 @ d8192 595.15 ± 1.66
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 tg32 @ d8192 36.75 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 pp2048 @ d16384 402.48 ± 0.27
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 tg32 @ d16384 29.10 ± 0.12
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 pp2048 @ d32768 224.79 ± 0.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 2048 1 tg32 @ d32768 19.99 ± 0.02

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 28, 2025
@Beinsezii
Copy link
Contributor

Beinsezii commented Dec 29, 2025

did a quick compile on gfx1100 looks like small 24b @ q6k matches pre #17576 perf now without forcing cublas.

qwen 30b3a unsloth q5kxl lost 40% of its pp vs mmq, might need a n_experts branch like the cdna path?

@Beinsezii
Copy link
Contributor

Beinsezii commented Dec 29, 2025

Naively copying the CDNA path as presented in #18202

diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
index 1a297971..0a0d440a 100644
--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
@@ -333,7 +333,10 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
     }
 
     if (amd_wmma_available(cc)) {
-        if (ne11 <= 128 || type == GGML_TYPE_Q4_0 || type == GGML_TYPE_Q4_1 || type == GGML_TYPE_Q5_0 || type == GGML_TYPE_Q5_1) {
+        if (n_experts > 64 || ne11 <= 128) {
+            return true;
+        }
+        if (type == GGML_TYPE_Q4_0 || type == GGML_TYPE_Q4_1 || type == GGML_TYPE_Q5_0 || type == GGML_TYPE_Q5_1) {
             return true;
         }
         if (ne11 <= 256 && (type == GGML_TYPE_Q4_K || type == GGML_TYPE_Q5_K)) {

makes this positive or equal on all the models I have happen to have on-disk, but I'm not sure if RDNA should have a different exp count or not.

looking through #14949 seems like this block was done by pretty much just A|B testing every quant type which I don't have time for today, but likely this needs to be re-done for RDNA

@jiachengjason jiachengjason marked this pull request as ready for review December 29, 2025 17:03
@Beinsezii
Copy link
Contributor

For my new year's resolution I've picked this up and created more granular switching in #18537 based on results from semi-automated benchmarks. Might want to check it on Strix Halo @jiachengjason

@JohannesGaessler
Copy link
Contributor

The testing methodology that I am using for matrix multiplications is something like this:

export mn=llama_3-8b
for q in q4_0 q4_1 q5_0 q5_1 q8_0 q2_k_s q3_k_s q4_k_s q5_k_s q6_k iq1_s iq2_xxs iq2_xs iq2_s iq3_xxs iq3_xs iq3_s iq3_m iq4_nl iq4_xs; do echo $q; ./bench --model models/opt/${mn}-${q}.gguf -r 1 -fa 1 -n 0 -p 2048 -ub "1-2048*2" --progress -o sql|sqlite3 llama-bench.sqlite; sleep 10; done

After doing the above for 2 commits,

scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite

can be used to create a table that compares the performance. That is the only format in which I will accept evidence for changes in performance. But previously similar PRs inadvertently resulted in performance regressions so I will not approve this or the other PR anyways unless I can confirm the numbers myself. But I cannot do that until Monday because the Strix Halo system that AMD sent me comes with only a single USB-A port.

@elfarolab
Copy link

@JohannesGaessler

sorry for the interruption, ./bench utility is tools/llama-bench , tools/server/bench/bench.py or rather some other?

Thank you so much.

@Beinsezii
Copy link
Contributor

The testing methodology that I am using for matrix multiplications is something like this

Probably worth mentioning in CONTRIBUTING.md? I updated my PR #18537

sorry for the interruption, ./bench utility is tools/llama-bench , tools/server/bench/bench.py or rather some other?

the llama-bench make target. elf binary.

@JohannesGaessler
Copy link
Contributor

sorry for the interruption, ./bench utility is tools/llama-bench , tools/server/bench/bench.py or rather some other?

Sorry, that is a symlink to build/bin/llama-bench and I forgot to replace it when copy-pasting.

Probably worth mentioning in CONTRIBUTING.md?

To be clear, when I said "only format" I meant llama-bench. And that is the standard that I have when reviewing PRs but not necessarily the standard that other maintainers have.

@JohannesGaessler
Copy link
Contributor

Superseded by #18537 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants