Enable ROCWMMA on all builds #8

danielholanda · 2025-08-13T21:37:03Z

Description

ROCWMMA often shows better prompt processing across models.

This PR compiles ROCM with ROCWMMA enabled.

Results

gpt-oss-20b

As shown below, ROCWMMA enabled (first test) gives a large speed boost for prompt processing (pp512): 1423.31 t/s vs 969.03 t/s without it. Roughly 47% faster.

With ROCWMMA

C:\Users\daniel-halo\Downloads\llama-windows-rocm-gfx1151-x64-fa>llama-bench.exe -m "C:\Users\daniel-halo\.cache\huggingface\hub\models--unsloth--gpt-oss-20b-GGUF\snapshots\4953a383fb4ca25caeb9bc0e366c441cbb219afe\gpt-oss-20b-Q4_K_M.gguf" -ngl 99 -fa 1
HIP Library Path: C:\Users\daniel-halo\Downloads\llama-windows-rocm-gfx1151-x64-fa\amdhip64_7.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           pp512 |       1423.31 ± 7.65 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           tg128 |         71.35 ± 0.28 |

Without ROCWMMA

C:\Users\daniel-halo\Downloads\llama-b1028-windows-rocm-gfx1151-x64-main>llama-bench.exe -m "C:\Users\daniel-halo\.cache\huggingface\hub\models--unsloth--gpt-oss-20b-GGUF\snapshots\4953a383fb4ca25caeb9bc0e366c441cbb219afe\gpt-oss-20b-Q4_K_M.gguf" -ngl 99 -fa 1            
HIP Library Path: C:\Users\daniel-halo\Downloads\llama-b1028-windows-rocm-gfx1151-x64-main\amdhip64_7.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           pp512 |       969.03 ± 12.72 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           tg128 |         69.89 ± 0.27 |

build: c24f4e26 (6149)

Target Devices

Compilation works across all target families as shown here:
https://github.com/lemonade-sdk/llamacpp-rocm/actions/runs/16947880973

Open Questions

Should we really use the latest version of https://github.com/ROCm/rocWMMA.git at all times or snap to a specific stable one?
Can we heavily simplify the ROCWMMA patch, given that HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h ggml-org/llama.cpp#15273 merged?

Closes #7

kpoineal · 2025-09-04T16:56:20Z

@danielholanda Is this going to be merged at some point? My testing shows a fairly significant performance boost.

danielholanda · 2025-09-05T00:07:10Z

Yes, I'm just now actively working on #9 and adding tests to ensure things work properly, so we can confidently merge this.

fassn · 2025-09-09T08:14:19Z

@danielholanda Is this going to be merged at some point? My testing shows a fairly significant performance boost.

To support this comment, here are my results.

danielholanda · 2025-09-09T15:57:08Z

@fassn Thanks for your comments. Planning on merging this today if this passes: https://github.com/lemonade-sdk/llamacpp-rocm/actions/runs/17588392336

danielholanda · 2025-09-09T22:44:46Z

All tests passing. Merging!

danielholanda added 4 commits August 11, 2025 18:29

Enable slash attention

f46cea5

Use Leo's awesome patch

72f8203

Clone rocwmma

bfc9e59

use dev packages

2792c73

danielholanda self-assigned this Aug 13, 2025

danielholanda marked this pull request as ready for review August 13, 2025 21:38

danielholanda mentioned this pull request Aug 13, 2025

For gfx1151, llama.cpp should be built with -DGGML_HIP_ROCWMMA_FATTN=ON for a big performance boost #7

Closed

Merge branch 'main' into dholanda/flash_attention

04c8340

danielholanda merged commit e9af9dc into main Sep 9, 2025
24 of 26 checks passed

danielholanda deleted the dholanda/flash_attention branch October 3, 2025 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable ROCWMMA on all builds #8

Enable ROCWMMA on all builds #8

danielholanda commented Aug 13, 2025 •

edited

Loading

Uh oh!

kpoineal commented Sep 4, 2025

Uh oh!

danielholanda commented Sep 5, 2025

Uh oh!

fassn commented Sep 9, 2025

Uh oh!

danielholanda commented Sep 9, 2025

Uh oh!

danielholanda commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable ROCWMMA on all builds #8

Enable ROCWMMA on all builds #8

Conversation

danielholanda commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Results

gpt-oss-20b

With ROCWMMA

Without ROCWMMA

Target Devices

Open Questions

Uh oh!

kpoineal commented Sep 4, 2025

Uh oh!

danielholanda commented Sep 5, 2025

Uh oh!

fassn commented Sep 9, 2025

Uh oh!

danielholanda commented Sep 9, 2025

Uh oh!

danielholanda commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielholanda commented Aug 13, 2025 •

edited

Loading