Skip to content

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated#18202

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
IMbackK:mmidopt
Dec 28, 2025
Merged

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated#18202
JohannesGaessler merged 1 commit intoggml-org:masterfrom
IMbackK:mmidopt

Conversation

@IMbackK
Copy link
Collaborator

@IMbackK IMbackK commented Dec 19, 2025

On MFMA hardware, MMQ performs better for medium sized problems, while dequant+rocblas performs better for large problem sizes.

currently ggml_cuda_should_use_mmq choses based on batch size and data type. This is suboptimal for MUL_MAT_ID as, even if the involved tensors are large, we end up calling rocblas for a large number of small tensors if the number of experts is high, causing poor performance.
This pr addresses this by choosing MMQ when the number of experts is high.

branch marks on a MI100 @ 160W power limit.

Model Microbatch size Test t/s master t/s mmidopt Speedup
gpt-oss 20B MXFP4 MoE 32 pp1024 737.25 745.02 1.01
gpt-oss 20B MXFP4 MoE 64 pp1024 962.68 974.75 1.01
gpt-oss 20B MXFP4 MoE 128 pp1024 955.28 967.76 1.01
gpt-oss 20B MXFP4 MoE 256 pp1024 1720.56 1725.10 1.00
gpt-oss 20B MXFP4 MoE 512 pp1024 2277.16 2291.13 1.01
gpt-oss 20B MXFP4 MoE 1024 pp1024 2665.15 2685.24 1.01
qwen3moe 30B.A3B Q4_K_M 32 pp1024 436.42 434.94 1.00
qwen3moe 30B.A3B Q4_K_M 64 pp1024 562.45 563.55 1.00
qwen3moe 30B.A3B Q4_K_M 128 pp1024 716.47 721.23 1.01
qwen3moe 30B.A3B Q4_K_M 256 pp1024 1032.03 1124.19 1.09
qwen3moe 30B.A3B Q4_K_M 512 pp1024 782.11 1497.25 1.91
qwen3moe 30B.A3B Q4_K_M 1024 pp1024 1058.36 1738.98 1.64

future note: possibly it would be better to select based on the size of the resulting splits.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 19, 2025
@JohannesGaessler JohannesGaessler merged commit 4ffc47c into ggml-org:master Dec 28, 2025
70 of 71 checks passed
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants