Skip to content

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes#14949

Merged
IMbackK merged 2 commits intoggml-org:masterfrom
IMbackK:mi100mfma
Jul 30, 2025
Merged

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes#14949
IMbackK merged 2 commits intoggml-org:masterfrom
IMbackK:mi100mfma

Conversation

@IMbackK
Copy link
Collaborator

@IMbackK IMbackK commented Jul 29, 2025

This PR enables the MFMA path graciously provided by @deepsek on CDNA1/gfx908 and CDNA2/gfx90a devices where this is more performant than the current blas path.

This PR is fairly careful and only enables the path on datatype and batch sizes where tested to be more performant. It is likely that other data types would also benefit, and i will follow up with more at a later time.

Measurements

All mesurents on gfx908, pp1024

Performance change:

Model Microbatch size Test t/s master t/s mi100mfma Speedup
llama 13B Q5_K_M 32 pp1024 141.71 385.04 2.72
llama 13B Q5_K_M 64 pp1024 214.31 575.49 2.69
llama 13B Q5_K_M 128 pp1024 388.50 707.50 1.82
llama 13B Q5_K_M 256 pp1024 653.29 777.11 1.19
llama 13B Q5_K_M 512 pp1024 850.12 847.13 1.00
llama 13B Q5_K_M 1024 pp1024 1137.55 1136.94 1.00
llama 13B Q8_0 32 pp1024 159.84 441.99 2.77
llama 13B Q8_0 64 pp1024 263.17 627.86 2.39
llama 13B Q8_0 128 pp1024 485.83 637.20 1.31
llama 13B Q8_0 256 pp1024 749.96 763.62 1.02
llama 13B Q8_0 512 pp1024 915.36 934.04 1.02
llama 13B Q8_0 1024 pp1024 1232.74 1216.02 0.99
llama 7B Q4_0 32 pp1024 465.36 1227.27 2.64
llama 7B Q4_0 64 pp1024 728.69 1615.49 2.22
llama 7B Q4_0 128 pp1024 1105.51 1850.48 1.67
llama 7B Q4_0 256 pp1024 1553.94 2158.73 1.39
llama 7B Q4_0 512 pp1024 1890.85 2220.39 1.17
llama 7B Q4_0 1024 pp1024 2115.34 2291.17 1.08
llama 7B Q5_0 32 pp1024 416.92 1016.73 2.44
llama 7B Q5_0 64 pp1024 757.01 1449.28 1.91
llama 7B Q5_0 128 pp1024 1125.59 1738.29 1.54
llama 7B Q5_0 256 pp1024 1567.06 2025.97 1.29
llama 7B Q5_0 512 pp1024 1875.81 2094.18 1.12
llama 7B Q5_0 1024 pp1024 1978.71 2153.81 1.09
llama 7B Q6_K 32 pp1024 312.87 775.24 2.48
llama 7B Q6_K 64 pp1024 688.96 1026.81 1.49
llama 7B Q6_K 128 pp1024 1064.16 1248.84 1.17
llama 7B Q6_K 256 pp1024 1484.92 1470.56 0.99
llama 7B Q6_K 512 pp1024 1731.99 1719.56 0.99
llama 7B Q6_K 1024 pp1024 1836.89 1835.06 1.00
llama 8B Q4_K_M 32 pp1024 301.40 1023.17 3.39
llama 8B Q4_K_M 64 pp1024 628.02 1426.31 2.27
llama 8B Q4_K_M 128 pp1024 1197.32 1814.15 1.52
llama 8B Q4_K_M 256 pp1024 1986.34 2217.39 1.12
llama 8B Q4_K_M 512 pp1024 2498.98 2490.22 1.00
llama 8B Q4_K_M 1024 pp1024 2726.83 2718.23 1.00

MFMA forced for all datatypes and batch sizes:

| model                          |     params | n_ubatch |                  t/s |
| ------------------------------ | ---------: | -------: | -------------------: |
| llama 7B Q4_0                  |     6.74 B |       32 |       1278.43 ± 3.66 |
| llama 7B Q4_0                  |     6.74 B |       64 |       1722.63 ± 4.00 |
| llama 7B Q4_0                  |     6.74 B |      128 |       2009.38 ± 5.06 |
| llama 7B Q4_0                  |     6.74 B |      256 |       2401.78 ± 2.37 |
| llama 7B Q4_0                  |     6.74 B |      512 |       2470.53 ± 7.83 |
| llama 7B Q4_0                  |     6.74 B |     1024 |       2571.22 ± 2.30 |
| llama 8B Q4_K - Medium         |     8.03 B |       32 |       1080.02 ± 2.33 |
| llama 8B Q4_K - Medium         |     8.03 B |       64 |       1553.58 ± 1.27 |
| llama 8B Q4_K - Medium         |     8.03 B |      128 |       2012.27 ± 0.61 |
| llama 8B Q4_K - Medium         |     8.03 B |      256 |       2286.34 ± 0.87 |
| llama 8B Q4_K - Medium         |     8.03 B |      512 |       2358.34 ± 1.46 |
| llama 8B Q4_K - Medium         |     8.03 B |     1024 |       2376.71 ± 0.72 |
| llama 7B Q5_0                  |     6.74 B |       32 |       1080.44 ± 1.48 |
| llama 7B Q5_0                  |     6.74 B |       64 |       1589.45 ± 0.79 |
| llama 7B Q5_0                  |     6.74 B |      128 |       1897.23 ± 4.16 |
| llama 7B Q5_0                  |     6.74 B |      256 |       2250.23 ± 1.01 |
| llama 7B Q5_0                  |     6.74 B |      512 |       2320.69 ± 1.50 |
| llama 7B Q5_0                  |     6.74 B |     1024 |       2342.06 ± 6.30 |
| llama 7B Q6_K                  |     6.74 B |       32 |        843.44 ± 2.82 |
| llama 7B Q6_K                  |     6.74 B |       64 |       1137.25 ± 2.91 |
| llama 7B Q6_K                  |     6.74 B |      128 |       1392.07 ± 0.57 |
| llama 7B Q6_K                  |     6.74 B |      256 |       1568.60 ± 0.38 |
| llama 7B Q6_K                  |     6.74 B |      512 |       1597.53 ± 0.40 |
| llama 7B Q6_K                  |     6.74 B |     1024 |       1620.83 ± 3.65 |
| llama 13B Q8_0                 |    23.57 B |       32 |        453.19 ± 0.44 |
| llama 13B Q8_0                 |    23.57 B |       64 |        674.65 ± 0.18 |
| llama 13B Q8_0                 |    23.57 B |      128 |        663.48 ± 3.70 |
| llama 13B Q8_0                 |    23.57 B |      256 |        739.34 ± 0.95 |
| llama 13B Q8_0                 |    23.57 B |      512 |        770.74 ± 1.21 |
| llama 13B Q8_0                 |    23.57 B |     1024 |        779.23 ± 1.25 |

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 29, 2025
@IMbackK
Copy link
Collaborator Author

IMbackK commented Jul 30, 2025

side note: i think it might also be worth trying stream-k on GCN

@IMbackK IMbackK merged commit ad4a700 into ggml-org:master Jul 30, 2025
87 of 88 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants