HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated by IMbackK · Pull Request #18202 · ggml-org/llama.cpp

IMbackK · 2025-12-19T13:17:08Z

On MFMA hardware, MMQ performs better for medium sized problems, while dequant+rocblas performs better for large problem sizes.

currently ggml_cuda_should_use_mmq choses based on batch size and data type. This is suboptimal for MUL_MAT_ID as, even if the involved tensors are large, we end up calling rocblas for a large number of small tensors if the number of experts is high, causing poor performance.
This pr addresses this by choosing MMQ when the number of experts is high.

branch marks on a MI100 @ 160W power limit.

Model	Microbatch size	Test	t/s master	t/s mmidopt	Speedup
gpt-oss 20B MXFP4 MoE	32	pp1024	737.25	745.02	1.01
gpt-oss 20B MXFP4 MoE	64	pp1024	962.68	974.75	1.01
gpt-oss 20B MXFP4 MoE	128	pp1024	955.28	967.76	1.01
gpt-oss 20B MXFP4 MoE	256	pp1024	1720.56	1725.10	1.00
gpt-oss 20B MXFP4 MoE	512	pp1024	2277.16	2291.13	1.01
gpt-oss 20B MXFP4 MoE	1024	pp1024	2665.15	2685.24	1.01
qwen3moe 30B.A3B Q4_K_M	32	pp1024	436.42	434.94	1.00
qwen3moe 30B.A3B Q4_K_M	64	pp1024	562.45	563.55	1.00
qwen3moe 30B.A3B Q4_K_M	128	pp1024	716.47	721.23	1.01
qwen3moe 30B.A3B Q4_K_M	256	pp1024	1032.03	1124.19	1.09
qwen3moe 30B.A3B Q4_K_M	512	pp1024	782.11	1497.25	1.91
qwen3moe 30B.A3B Q4_K_M	1024	pp1024	1058.36	1738.98	1.64

future note: possibly it would be better to select based on the size of the resulting splits.

ggml/src/ggml-cuda/mmq.cu

…plits would be generated

…plits would be generated (#18202)

IMbackK requested a review from JohannesGaessler as a code owner December 19, 2025 13:17

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 19, 2025

loci-dev mentioned this pull request Dec 19, 2025

UPSTREAM PR #18202: HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated auroralabs-loci/llama.cpp#623

Closed

JohannesGaessler reviewed Dec 22, 2025

View reviewed changes

ggml/src/ggml-cuda/mmq.cu Outdated Show resolved Hide resolved

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of s…

ed4c388

…plits would be generated

IMbackK force-pushed the mmidopt branch from 7b05bd2 to ed4c388 Compare December 28, 2025 17:49

JohannesGaessler approved these changes Dec 28, 2025

View reviewed changes

JohannesGaessler merged commit 4ffc47c into ggml-org:master Dec 28, 2025
70 of 71 checks passed

This was referenced Dec 29, 2025

Patch perf regression for mmq kernels in ROCm #18442

Closed

mmq.cu: tune mmq/rocblas switching for RDNA #18537

Merged

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of s…

a2900b1

…plits would be generated (#18202)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated#18202

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated#18202
JohannesGaessler merged 1 commit intoggml-org:masterfrom
IMbackK:mmidopt

IMbackK commented Dec 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IMbackK commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IMbackK commented Dec 19, 2025 •

edited

Loading