HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes by IMbackK · Pull Request #14949 · ggml-org/llama.cpp

IMbackK · 2025-07-29T18:43:09Z

This PR enables the MFMA path graciously provided by @deepsek on CDNA1/gfx908 and CDNA2/gfx90a devices where this is more performant than the current blas path.

This PR is fairly careful and only enables the path on datatype and batch sizes where tested to be more performant. It is likely that other data types would also benefit, and i will follow up with more at a later time.

Measurements

All mesurents on gfx908, pp1024

Performance change:

Model	Microbatch size	Test	t/s master	t/s mi100mfma	Speedup
llama 13B Q5_K_M	32	pp1024	141.71	385.04	2.72
llama 13B Q5_K_M	64	pp1024	214.31	575.49	2.69
llama 13B Q5_K_M	128	pp1024	388.50	707.50	1.82
llama 13B Q5_K_M	256	pp1024	653.29	777.11	1.19
llama 13B Q5_K_M	512	pp1024	850.12	847.13	1.00
llama 13B Q5_K_M	1024	pp1024	1137.55	1136.94	1.00
llama 13B Q8_0	32	pp1024	159.84	441.99	2.77
llama 13B Q8_0	64	pp1024	263.17	627.86	2.39
llama 13B Q8_0	128	pp1024	485.83	637.20	1.31
llama 13B Q8_0	256	pp1024	749.96	763.62	1.02
llama 13B Q8_0	512	pp1024	915.36	934.04	1.02
llama 13B Q8_0	1024	pp1024	1232.74	1216.02	0.99
llama 7B Q4_0	32	pp1024	465.36	1227.27	2.64
llama 7B Q4_0	64	pp1024	728.69	1615.49	2.22
llama 7B Q4_0	128	pp1024	1105.51	1850.48	1.67
llama 7B Q4_0	256	pp1024	1553.94	2158.73	1.39
llama 7B Q4_0	512	pp1024	1890.85	2220.39	1.17
llama 7B Q4_0	1024	pp1024	2115.34	2291.17	1.08
llama 7B Q5_0	32	pp1024	416.92	1016.73	2.44
llama 7B Q5_0	64	pp1024	757.01	1449.28	1.91
llama 7B Q5_0	128	pp1024	1125.59	1738.29	1.54
llama 7B Q5_0	256	pp1024	1567.06	2025.97	1.29
llama 7B Q5_0	512	pp1024	1875.81	2094.18	1.12
llama 7B Q5_0	1024	pp1024	1978.71	2153.81	1.09
llama 7B Q6_K	32	pp1024	312.87	775.24	2.48
llama 7B Q6_K	64	pp1024	688.96	1026.81	1.49
llama 7B Q6_K	128	pp1024	1064.16	1248.84	1.17
llama 7B Q6_K	256	pp1024	1484.92	1470.56	0.99
llama 7B Q6_K	512	pp1024	1731.99	1719.56	0.99
llama 7B Q6_K	1024	pp1024	1836.89	1835.06	1.00
llama 8B Q4_K_M	32	pp1024	301.40	1023.17	3.39
llama 8B Q4_K_M	64	pp1024	628.02	1426.31	2.27
llama 8B Q4_K_M	128	pp1024	1197.32	1814.15	1.52
llama 8B Q4_K_M	256	pp1024	1986.34	2217.39	1.12
llama 8B Q4_K_M	512	pp1024	2498.98	2490.22	1.00
llama 8B Q4_K_M	1024	pp1024	2726.83	2718.23	1.00

MFMA forced for all datatypes and batch sizes:

| model                          |     params | n_ubatch |                  t/s |
| ------------------------------ | ---------: | -------: | -------------------: |
| llama 7B Q4_0                  |     6.74 B |       32 |       1278.43 ± 3.66 |
| llama 7B Q4_0                  |     6.74 B |       64 |       1722.63 ± 4.00 |
| llama 7B Q4_0                  |     6.74 B |      128 |       2009.38 ± 5.06 |
| llama 7B Q4_0                  |     6.74 B |      256 |       2401.78 ± 2.37 |
| llama 7B Q4_0                  |     6.74 B |      512 |       2470.53 ± 7.83 |
| llama 7B Q4_0                  |     6.74 B |     1024 |       2571.22 ± 2.30 |
| llama 8B Q4_K - Medium         |     8.03 B |       32 |       1080.02 ± 2.33 |
| llama 8B Q4_K - Medium         |     8.03 B |       64 |       1553.58 ± 1.27 |
| llama 8B Q4_K - Medium         |     8.03 B |      128 |       2012.27 ± 0.61 |
| llama 8B Q4_K - Medium         |     8.03 B |      256 |       2286.34 ± 0.87 |
| llama 8B Q4_K - Medium         |     8.03 B |      512 |       2358.34 ± 1.46 |
| llama 8B Q4_K - Medium         |     8.03 B |     1024 |       2376.71 ± 0.72 |
| llama 7B Q5_0                  |     6.74 B |       32 |       1080.44 ± 1.48 |
| llama 7B Q5_0                  |     6.74 B |       64 |       1589.45 ± 0.79 |
| llama 7B Q5_0                  |     6.74 B |      128 |       1897.23 ± 4.16 |
| llama 7B Q5_0                  |     6.74 B |      256 |       2250.23 ± 1.01 |
| llama 7B Q5_0                  |     6.74 B |      512 |       2320.69 ± 1.50 |
| llama 7B Q5_0                  |     6.74 B |     1024 |       2342.06 ± 6.30 |
| llama 7B Q6_K                  |     6.74 B |       32 |        843.44 ± 2.82 |
| llama 7B Q6_K                  |     6.74 B |       64 |       1137.25 ± 2.91 |
| llama 7B Q6_K                  |     6.74 B |      128 |       1392.07 ± 0.57 |
| llama 7B Q6_K                  |     6.74 B |      256 |       1568.60 ± 0.38 |
| llama 7B Q6_K                  |     6.74 B |      512 |       1597.53 ± 0.40 |
| llama 7B Q6_K                  |     6.74 B |     1024 |       1620.83 ± 3.65 |
| llama 13B Q8_0                 |    23.57 B |       32 |        453.19 ± 0.44 |
| llama 13B Q8_0                 |    23.57 B |       64 |        674.65 ± 0.18 |
| llama 13B Q8_0                 |    23.57 B |      128 |        663.48 ± 3.70 |
| llama 13B Q8_0                 |    23.57 B |      256 |        739.34 ± 0.95 |
| llama 13B Q8_0                 |    23.57 B |      512 |        770.74 ± 1.21 |
| llama 13B Q8_0                 |    23.57 B |     1024 |        779.23 ± 1.25 |

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/mmq.cu

…apes

IMbackK · 2025-07-30T09:55:19Z

side note: i think it might also be worth trying stream-k on GCN

ggml/src/ggml-cuda/mmq.cu

…s and shapes (ggml-org#14949)" This reverts commit ad4a700.

…apes (#14949)

IMbackK requested a review from JohannesGaessler as a code owner July 29, 2025 18:43

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 29, 2025

deepsek reviewed Jul 29, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

deepsek reviewed Jul 29, 2025

View reviewed changes

ggml/src/ggml-cuda/mmq.cu Outdated Show resolved Hide resolved

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and sh…

0e7995d

…apes

IMbackK force-pushed the mi100mfma branch from cf20caa to 0e7995d Compare July 30, 2025 09:40

JohannesGaessler approved these changes Jul 30, 2025

View reviewed changes

ggml/src/ggml-cuda/mmq.cu Outdated Show resolved Hide resolved

Remove parentheses

7875fdf

IMbackK merged commit ad4a700 into ggml-org:master Jul 30, 2025
87 of 88 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025

Revert "HIP: enable mfma mmq on gfx908 and gfx90a for select datatype…

c278a68

…s and shapes (ggml-org#14949)" This reverts commit ad4a700.

This was referenced Dec 29, 2025

Patch perf regression for mmq kernels in ROCm #18442

Closed

mmq.cu: tune mmq/rocblas switching for RDNA #18537

Merged

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and sh…

f7e1ba8

…apes (#14949)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes#14949

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes#14949
IMbackK merged 2 commits intoggml-org:masterfrom
IMbackK:mi100mfma

IMbackK commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

IMbackK commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

IMbackK commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance change:

MFMA forced for all datatypes and batch sizes:

Uh oh!

Uh oh!

Uh oh!

IMbackK commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IMbackK commented Jul 29, 2025 •

edited

Loading