UPSTREAM PR #18202: HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated by loci-dev · Pull Request #623 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-19T13:42:18Z

On MFMA hardware, MMQ performs better for medium sized problems, while dequant+rocblas performs better for large problem sizes.

currently ggml_cuda_should_use_mmq choses based on batch size and data type. This is suboptimal for MUL_MAT_ID as, even if the involved tensors are large, we end up calling rocblas for a large number of small tensors if the number of experts is high, causing poor performance.
This pr addresses this by choosing MMQ when the number of experts is high.

branch marks on a MI100 @ 160W power limit.

Model	Microbatch size	Test	t/s master	t/s mmidopt	Speedup
gpt-oss 20B MXFP4 MoE	32	pp1024	737.25	745.02	1.01
gpt-oss 20B MXFP4 MoE	64	pp1024	962.68	974.75	1.01
gpt-oss 20B MXFP4 MoE	128	pp1024	955.28	967.76	1.01
gpt-oss 20B MXFP4 MoE	256	pp1024	1720.56	1725.10	1.00
gpt-oss 20B MXFP4 MoE	512	pp1024	2277.16	2291.13	1.01
gpt-oss 20B MXFP4 MoE	1024	pp1024	2665.15	2685.24	1.01
qwen3moe 30B.A3B Q4_K_M	32	pp1024	436.42	434.94	1.00
qwen3moe 30B.A3B Q4_K_M	64	pp1024	562.45	563.55	1.00
qwen3moe 30B.A3B Q4_K_M	128	pp1024	716.47	721.23	1.01
qwen3moe 30B.A3B Q4_K_M	256	pp1024	1032.03	1124.19	1.09
qwen3moe 30B.A3B Q4_K_M	512	pp1024	782.11	1497.25	1.91
qwen3moe 30B.A3B Q4_K_M	1024	pp1024	1058.36	1738.98	1.64

future note: possibly it would be better to select based on the size of the resulting splits.

…plits would be generated

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of s…

7b05bd2

…plits would be generated

loci-dev had a problem deploying to PROD__AL_DEMO December 19, 2025 13:42 — with GitHub Actions Error

loci-dev force-pushed the main branch 19 times, most recently from 26a6f0f to cf53bc9 Compare December 22, 2025 14:09

DajanaV closed this Dec 22, 2025

DajanaV deleted the upstream-PR18202-branch_IMbackK-mmidopt branch December 22, 2025 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18202: HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated#623

UPSTREAM PR #18202: HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated#623
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18202-branch_IMbackK-mmidopt

loci-dev commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants