Skip to content

llama: fuse QKV weights into single matmul for LLaMA models (ROCm/HIP)#19481

Closed
JoursBleu wants to merge 1 commit intoggml-org:masterfrom
JoursBleu:pr/qkv-fusion
Closed

llama: fuse QKV weights into single matmul for LLaMA models (ROCm/HIP)#19481
JoursBleu wants to merge 1 commit intoggml-org:masterfrom
JoursBleu:pr/qkv-fusion

Conversation

@JoursBleu
Copy link
Contributor

Summary

Fuse Q/K/V weight matrices into a single concatenated QKV tensor at model load time for LLaMA architectures on ROCm/HIP, reducing 3 mul_mat kernel launches to 1 per transformer layer during inference.

Motivation

On NVIDIA GPUs, PR #16991 uses CUDA stream concurrency to overlap Q/K/V matmuls via CUDA Graphs with multi-stream event synchronization. However, on AMD GPUs (ROCm/HIP), HIP Graph replay with multi-stream event synchronization incurs 14× overhead on hipGraphLaunch (0.178ms → 2.498ms), making stream-based concurrency counterproductive.

QKV weight fusion provides an alternative approach for ROCm: instead of parallelizing 3 kernel launches, it eliminates 2 of them entirely by concatenating the weight matrices and using a single larger mul_mat.

Changes

src/llama-model.cpp

  • After tensor loading, if GGML_USE_HIP is defined and the architecture is LLM_ARCH_LLAMA or LLM_ARCH_LLAMA_EMBED:
    • Concatenates wq, wk, wv row-wise into a single wqkv tensor ([n_embd, n_out_q + n_out_k + n_out_v])
    • Copies the fused tensor to device memory via a new ggml_backend_buffer
    • Stores the result in the existing layer.wqkv field

src/models/llama.cpp

  • Guarded by #if defined(GGML_USE_HIP):
    • If layer.wqkv exists, performs a single ggml_mul_mat followed by ggml_view_3d to split Q/K/V outputs
    • Otherwise falls through to the original separate build_lora_mm(wq/wk/wv) path
  • The original path is always compiled and serves as the fallback on all platforms

Scope

  • ROCm/HIP only: All new code is wrapped in #if defined(GGML_USE_HIP). Zero impact on CUDA, Metal, CPU, or other backends.
  • Architecture: Currently limited to LLaMA / LLaMA-Embed models.
  • No new struct fields: Reuses the existing wqkv pointer in llama_layer.

Benchmark

Hardware: AMD Radeon AI PRO R9700 (gfx1201, RDNA4), ROCm 6.3
Model: Llama 2 7B Q4_0, single GPU, tg128, HIP Graphs enabled

Configuration tok/s vs Master
Master baseline 95.00
QKV Fusion (this PR) 100.85 +6.2%

Concatenate wq/wk/wv weight matrices into a single wqkv tensor at model
load time. During inference, perform one MUL_MAT instead of three separate
Q/K/V matmuls per layer, reducing kernel launch overhead.

Only enabled for LLM_ARCH_LLAMA and LLM_ARCH_LLAMA_EMBED. Falls back
gracefully to the original separate Q/K/V path if fusion fails or weights
are missing. Supports QKV bias (e.g. for Llama variants with bias).

Changes:
- src/llama-model.cpp: post-load QKV weight concatenation (Q+K+V -> QKV)
- src/models/llama.cpp: fused QKV inference path with view_3d split

Benchmark (Llama 2 7B Q4_0, AMD Radeon AI PRO R9700 gfx1201):
  Master:     95.05 tok/s (tg128)
  QKV fusion: 99.31 tok/s (tg128) -> +4.5%
  pp512: no regression
@JoursBleu JoursBleu requested a review from CISC as a code owner February 10, 2026 08:35
@CISC CISC closed this Feb 10, 2026
@JoursBleu
Copy link
Contributor Author

Hi @CISC, could you share the reason for closing this PR? I'd like to understand if there are specific concerns so I can address them or take a different direction. Thanks!

@CISC
Copy link
Member

CISC commented Feb 10, 2026

Hi @CISC, could you share the reason for closing this PR? I'd like to understand if there are specific concerns so I can address them or take a different direction. Thanks!

As such your previous PR made more sense (we already have plans in that direction for all models, though may not be at load), but this PR gated for HIP only makes no sense.

@JoursBleu
Copy link
Contributor Author

Hi @CISC, could you share the reason for closing this PR? I'd like to understand if there are specific concerns so I can address them or take a different direction. Thanks!

As such your previous PR made more sense (we already have plans in that direction for all models, though may not be at load), but this PR gated for HIP only makes no sense.

Hi @CISC, thanks for the explanation!

The reason I proposed the HIP-specific approach was that #16991 (stream-based concurrency) superseded #16813 (fused QKV), which led me to believe the project preferred stream concurrency over weight fusion for CUDA. However, HIP Graph replay with multi-stream event synchronization has a 14× overhead on hipGraphLaunch, making stream concurrency counterproductive on AMD GPUs. So I gated the fusion for HIP only as an alternative.

Given your feedback, should I reopen #19477 (the non-gated version) instead?

@CISC
Copy link
Member

CISC commented Feb 10, 2026

Given your feedback, should I reopen #19477 (the non-gated version) instead?

I don't think that is necessary at this point, as mentioned an alternative approach is currently planned, read the thread in #19139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants