vulkan: fuse rms_norm + mul + rope (+ view + set_rows) by jeffbolznv · Pull Request #16977 · ggml-org/llama.cpp

jeffbolznv · 2025-11-03T19:05:01Z

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

Helps a couple percent on models where it applies.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.45 ± 11.28 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        271.92 ± 3.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        273.37 ± 1.46 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.13 ± 1.33 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.37 ± 1.21 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        134.55 ± 3.63 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.55 ± 0.32 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.51 ± 0.51 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.49 ± 0.43 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.57 ± 0.33 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       472.71 ± 38.19 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       478.29 ± 10.31 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       468.20 ± 16.25 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        483.64 ± 2.21 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        484.80 ± 2.20 |

build: 1ae74882f (6913)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.94 ± 17.36 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        275.08 ± 2.82 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.42 ± 1.60 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.48 ± 1.57 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        277.64 ± 0.90 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.12 ± 3.41 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.25 ± 3.01 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.57 ± 0.47 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.33 ± 0.56 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.56 ± 0.53 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       482.35 ± 47.67 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       488.51 ± 11.51 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        487.90 ± 6.93 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        494.89 ± 4.11 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        496.09 ± 5.04 |

build: b74de9b7b (6915)

ggml/src/ggml-vulkan/ggml-vulkan.cpp

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

0cc4m

LGTM

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

jeffbolznv requested review from 0cc4m and slaren as code owners November 3, 2025 19:05

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 3, 2025

jeffbolznv commented Nov 3, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Show resolved Hide resolved

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows) auroralabs-loci/llama.cpp#53

Closed

jeffbolznv force-pushed the rmsnorm_rope_fusion branch from e565af7 to 7ca2c06 Compare November 4, 2025 20:13

vulkan: fuse rms_norm + mul + rope (+ view + set_rows)

16b7301

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

jeffbolznv force-pushed the rmsnorm_rope_fusion branch from 7ca2c06 to 16b7301 Compare November 5, 2025 19:52

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows) auroralabs-loci/llama.cpp#98

Open

0cc4m approved these changes Nov 8, 2025

View reviewed changes

0cc4m merged commit b4e335d into ggml-org:master Nov 8, 2025
66 of 71 checks passed

netrunnereve mentioned this pull request Nov 10, 2025

vulkan: disable rms_norm + mul + rope for old gpus #17134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: fuse rms_norm + mul + rope (+ view + set_rows)#16977

vulkan: fuse rms_norm + mul + rope (+ view + set_rows)#16977
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:rmsnorm_rope_fusion

jeffbolznv commented Nov 3, 2025

Uh oh!

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffbolznv commented Nov 3, 2025

Uh oh!

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants