Skip to content

Conversation

@danielholanda
Copy link
Contributor

@danielholanda danielholanda commented Aug 13, 2025

Description

ROCWMMA often shows better prompt processing across models.

This PR compiles ROCM with ROCWMMA enabled.

Results

gpt-oss-20b

As shown below, ROCWMMA enabled (first test) gives a large speed boost for prompt processing (pp512): 1423.31 t/s vs 969.03 t/s without it. Roughly 47% faster.

With ROCWMMA

C:\Users\daniel-halo\Downloads\llama-windows-rocm-gfx1151-x64-fa>llama-bench.exe -m "C:\Users\daniel-halo\.cache\huggingface\hub\models--unsloth--gpt-oss-20b-GGUF\snapshots\4953a383fb4ca25caeb9bc0e366c441cbb219afe\gpt-oss-20b-Q4_K_M.gguf" -ngl 99 -fa 1
HIP Library Path: C:\Users\daniel-halo\Downloads\llama-windows-rocm-gfx1151-x64-fa\amdhip64_7.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           pp512 |       1423.31 ± 7.65 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           tg128 |         71.35 ± 0.28 |

Without ROCWMMA

C:\Users\daniel-halo\Downloads\llama-b1028-windows-rocm-gfx1151-x64-main>llama-bench.exe -m "C:\Users\daniel-halo\.cache\huggingface\hub\models--unsloth--gpt-oss-20b-GGUF\snapshots\4953a383fb4ca25caeb9bc0e366c441cbb219afe\gpt-oss-20b-Q4_K_M.gguf" -ngl 99 -fa 1            
HIP Library Path: C:\Users\daniel-halo\Downloads\llama-b1028-windows-rocm-gfx1151-x64-main\amdhip64_7.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           pp512 |       969.03 ± 12.72 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | ROCm       |  99 |  1 |           tg128 |         69.89 ± 0.27 |

build: c24f4e26 (6149)

Target Devices

Compilation works across all target families as shown here:
https://github.com/lemonade-sdk/llamacpp-rocm/actions/runs/16947880973

Open Questions

Closes #7

@kpoineal
Copy link

kpoineal commented Sep 4, 2025

@danielholanda Is this going to be merged at some point? My testing shows a fairly significant performance boost.

@danielholanda
Copy link
Contributor Author

Yes, I'm just now actively working on #9 and adding tests to ensure things work properly, so we can confidently merge this.

@fassn
Copy link

fassn commented Sep 9, 2025

@danielholanda Is this going to be merged at some point? My testing shows a fairly significant performance boost.

To support this comment, here are my results.

@danielholanda
Copy link
Contributor Author

@fassn Thanks for your comments. Planning on merging this today if this passes: https://github.com/lemonade-sdk/llamacpp-rocm/actions/runs/17588392336

@danielholanda
Copy link
Contributor Author

All tests passing. Merging!

@danielholanda danielholanda merged commit e9af9dc into main Sep 9, 2025
24 of 26 checks passed
@danielholanda danielholanda deleted the dholanda/flash_attention branch October 3, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

For gfx1151, llama.cpp should be built with -DGGML_HIP_ROCWMMA_FATTN=ON for a big performance boost

3 participants