llama: fuse QKV weights into single matmul for LLaMA models (ROCm/HIP) by JoursBleu · Pull Request #19481 · ggml-org/llama.cpp

JoursBleu · 2026-02-10T08:35:21Z

Summary

Fuse Q/K/V weight matrices into a single concatenated QKV tensor at model load time for LLaMA architectures on ROCm/HIP, reducing 3 mul_mat kernel launches to 1 per transformer layer during inference.

Motivation

On NVIDIA GPUs, PR #16991 uses CUDA stream concurrency to overlap Q/K/V matmuls via CUDA Graphs with multi-stream event synchronization. However, on AMD GPUs (ROCm/HIP), HIP Graph replay with multi-stream event synchronization incurs 14× overhead on hipGraphLaunch (0.178ms → 2.498ms), making stream-based concurrency counterproductive.

QKV weight fusion provides an alternative approach for ROCm: instead of parallelizing 3 kernel launches, it eliminates 2 of them entirely by concatenating the weight matrices and using a single larger mul_mat.

Changes

`src/llama-model.cpp`

After tensor loading, if GGML_USE_HIP is defined and the architecture is LLM_ARCH_LLAMA or LLM_ARCH_LLAMA_EMBED:
- Concatenates wq, wk, wv row-wise into a single wqkv tensor ([n_embd, n_out_q + n_out_k + n_out_v])
- Copies the fused tensor to device memory via a new ggml_backend_buffer
- Stores the result in the existing layer.wqkv field

`src/models/llama.cpp`

Guarded by #if defined(GGML_USE_HIP):
- If layer.wqkv exists, performs a single ggml_mul_mat followed by ggml_view_3d to split Q/K/V outputs
- Otherwise falls through to the original separate build_lora_mm(wq/wk/wv) path
The original path is always compiled and serves as the fallback on all platforms

Scope

ROCm/HIP only: All new code is wrapped in #if defined(GGML_USE_HIP). Zero impact on CUDA, Metal, CPU, or other backends.
Architecture: Currently limited to LLaMA / LLaMA-Embed models.
No new struct fields: Reuses the existing wqkv pointer in llama_layer.

Benchmark

Hardware: AMD Radeon AI PRO R9700 (gfx1201, RDNA4), ROCm 6.3
Model: Llama 2 7B Q4_0, single GPU, tg128, HIP Graphs enabled

Configuration	tok/s	vs Master
Master baseline	95.00	—
QKV Fusion (this PR)	100.85	+6.2%

Concatenate wq/wk/wv weight matrices into a single wqkv tensor at model load time. During inference, perform one MUL_MAT instead of three separate Q/K/V matmuls per layer, reducing kernel launch overhead. Only enabled for LLM_ARCH_LLAMA and LLM_ARCH_LLAMA_EMBED. Falls back gracefully to the original separate Q/K/V path if fusion fails or weights are missing. Supports QKV bias (e.g. for Llama variants with bias). Changes: - src/llama-model.cpp: post-load QKV weight concatenation (Q+K+V -> QKV) - src/models/llama.cpp: fused QKV inference path with view_3d split Benchmark (Llama 2 7B Q4_0, AMD Radeon AI PRO R9700 gfx1201): Master: 95.05 tok/s (tg128) QKV fusion: 99.31 tok/s (tg128) -> +4.5% pp512: no regression

JoursBleu · 2026-02-10T08:52:06Z

Hi @CISC, could you share the reason for closing this PR? I'd like to understand if there are specific concerns so I can address them or take a different direction. Thanks!

CISC · 2026-02-10T08:55:30Z

Hi @CISC, could you share the reason for closing this PR? I'd like to understand if there are specific concerns so I can address them or take a different direction. Thanks!

As such your previous PR made more sense (we already have plans in that direction for all models, though may not be at load), but this PR gated for HIP only makes no sense.

JoursBleu · 2026-02-10T09:15:28Z

Hi @CISC, could you share the reason for closing this PR? I'd like to understand if there are specific concerns so I can address them or take a different direction. Thanks!

As such your previous PR made more sense (we already have plans in that direction for all models, though may not be at load), but this PR gated for HIP only makes no sense.

Hi @CISC, thanks for the explanation!

The reason I proposed the HIP-specific approach was that #16991 (stream-based concurrency) superseded #16813 (fused QKV), which led me to believe the project preferred stream concurrency over weight fusion for CUDA. However, HIP Graph replay with multi-stream event synchronization has a 14× overhead on hipGraphLaunch, making stream concurrency counterproductive on AMD GPUs. So I gated the fusion for HIP only as an alternative.

Given your feedback, should I reopen #19477 (the non-gated version) instead?

CISC · 2026-02-10T09:24:07Z

Given your feedback, should I reopen #19477 (the non-gated version) instead?

I don't think that is necessary at this point, as mentioned an alternative approach is currently planned, read the thread in #19139

JoursBleu requested a review from CISC as a code owner February 10, 2026 08:35

CISC closed this Feb 10, 2026

JoursBleu mentioned this pull request Mar 16, 2026

model : add QKV weight fusion for LLaMA, Qwen2, and Qwen3 #20628

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: fuse QKV weights into single matmul for LLaMA models (ROCm/HIP)#19481

llama: fuse QKV weights into single matmul for LLaMA models (ROCm/HIP)#19481
JoursBleu wants to merge 1 commit intoggml-org:masterfrom
JoursBleu:pr/qkv-fusion

JoursBleu commented Feb 10, 2026

Uh oh!

JoursBleu commented Feb 10, 2026

Uh oh!

CISC commented Feb 10, 2026

Uh oh!

JoursBleu commented Feb 10, 2026

Uh oh!

CISC commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JoursBleu commented Feb 10, 2026

Summary

Motivation

Changes

src/llama-model.cpp

src/models/llama.cpp

Scope

Benchmark

Uh oh!

JoursBleu commented Feb 10, 2026

Uh oh!

CISC commented Feb 10, 2026

Uh oh!

JoursBleu commented Feb 10, 2026

Uh oh!

CISC commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/llama-model.cpp`

`src/models/llama.cpp`