UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows) by DajanaV · Pull Request #98 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-05T20:37:51Z

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

Helps a couple percent on models where it applies.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.45 ± 11.28 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        271.92 ± 3.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        273.37 ± 1.46 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.13 ± 1.33 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.37 ± 1.21 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        134.55 ± 3.63 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.55 ± 0.32 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.51 ± 0.51 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.49 ± 0.43 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.57 ± 0.33 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       472.71 ± 38.19 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       478.29 ± 10.31 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       468.20 ± 16.25 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        483.64 ± 2.21 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        484.80 ± 2.20 |

build: 1ae74882f (6913)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.94 ± 17.36 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        275.08 ± 2.82 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.42 ± 1.60 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.48 ± 1.57 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        277.64 ± 0.90 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.12 ± 3.41 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.25 ± 3.01 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.57 ± 0.47 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.33 ± 0.56 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.56 ± 0.53 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       482.35 ± 47.67 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       488.51 ± 11.51 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        487.90 ± 6.93 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        494.89 ± 4.11 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        496.09 ± 5.04 |

build: b74de9b7b (6915)

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

loci-review · 2025-11-05T21:10:03Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 33bbf0f4-26d4-4602-97dd-38ed3f5d6d85 compared to baseline b43f2432-b966-4c75-8c68-cb69d4ca588c reveals minimal performance impact with changes isolated to non-critical auxiliary functions.

Key Findings

Performance Metrics

• Highest Response Time Change: linenoiseBeep function (+0.17%, 76 ns → 76 ns)
• Highest Throughput Change: linenoiseBeep function (+0.21%, 61 ns → 61 ns)
• Core Function Impact: No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize)

Inference Performance Impact

• Tokens per Second: No measurable impact expected as core tokenization/inference functions remain unchanged
• Critical Path Analysis: Performance changes are isolated to terminal interface operations, not model processing pipelines
• Function Scope: Changes affect only auxiliary UI functions, not the core LLM inference engine

Power Consumption Analysis

• Overall Change: Negligible across all binaries (< 0.001%)
• Total Estimated Power: ~1.77 millijoules across all binaries
• Impacted Binaries: Minor fluctuations in llama-cvector-generator and llama-tts within measurement precision
• Core Binaries: libllama.so, libggml-cpu.so, libggml-base.so show zero measurable power consumption change

Technical Analysis

• Flame Graph Insights: linenoiseBeep shows 81.3% internal processing overhead with consistent 7 ns system call durations, indicating stable kernel interface performance
• CFG Comparison: Identical assembly code generation between versions, suggesting micro-architectural timing effects rather than algorithmic changes
• Root Cause: Performance variation likely stems from binary layout or instruction cache alignment differences

GitHub Code Review

• PR #98 Focus: Vulkan GPU acceleration optimizations for neural network operations (RMS norm + multiplication + ROPE fusion)
• Scope Disconnect: The measured performance changes in linenoiseBeep are unrelated to the PR's GPU compute optimizations
• Implementation Quality: PR demonstrates sophisticated GPU optimization with 3-5% throughput improvements for applicable models

The analysis indicates stable performance with no impact on core inference capabilities.

vulkan: fuse rms_norm + mul + rope (+ view + set_rows)

16b7301

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

DajanaV temporarily deployed to PROD__AL_DEMO November 5, 2025 20:37 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 6f3825c to 60ff545 Compare November 5, 2025 21:07

DajanaV force-pushed the main branch 26 times, most recently from aa2fc28 to 0ad40ce Compare November 9, 2025 17:06

DajanaV force-pushed the main branch 30 times, most recently from e97d4a6 to 29827de Compare November 15, 2025 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows)#98

UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows)#98
DajanaV wants to merge 1 commit intomainfrom
upstream-PR16977-branch_jeffbolznv-rmsnorm_rope_fusion

DajanaV commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Performance Analysis Summary

Overview

Key Findings

Performance Metrics

Inference Performance Impact

Power Consumption Analysis

Technical Analysis

GitHub Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants