llama: fuse QKV weights into single matmul for LLaMA models (ROCm/HIP)#19481
llama: fuse QKV weights into single matmul for LLaMA models (ROCm/HIP)#19481JoursBleu wants to merge 1 commit intoggml-org:masterfrom
Conversation
Concatenate wq/wk/wv weight matrices into a single wqkv tensor at model load time. During inference, perform one MUL_MAT instead of three separate Q/K/V matmuls per layer, reducing kernel launch overhead. Only enabled for LLM_ARCH_LLAMA and LLM_ARCH_LLAMA_EMBED. Falls back gracefully to the original separate Q/K/V path if fusion fails or weights are missing. Supports QKV bias (e.g. for Llama variants with bias). Changes: - src/llama-model.cpp: post-load QKV weight concatenation (Q+K+V -> QKV) - src/models/llama.cpp: fused QKV inference path with view_3d split Benchmark (Llama 2 7B Q4_0, AMD Radeon AI PRO R9700 gfx1201): Master: 95.05 tok/s (tg128) QKV fusion: 99.31 tok/s (tg128) -> +4.5% pp512: no regression
|
Hi @CISC, could you share the reason for closing this PR? I'd like to understand if there are specific concerns so I can address them or take a different direction. Thanks! |
As such your previous PR made more sense (we already have plans in that direction for all models, though may not be at load), but this PR gated for HIP only makes no sense. |
Hi @CISC, thanks for the explanation! The reason I proposed the HIP-specific approach was that #16991 (stream-based concurrency) superseded #16813 (fused QKV), which led me to believe the project preferred stream concurrency over weight fusion for CUDA. However, HIP Graph replay with multi-stream event synchronization has a 14× overhead on Given your feedback, should I reopen #19477 (the non-gated version) instead? |
Summary
Fuse Q/K/V weight matrices into a single concatenated QKV tensor at model load time for LLaMA architectures on ROCm/HIP, reducing 3
mul_matkernel launches to 1 per transformer layer during inference.Motivation
On NVIDIA GPUs, PR #16991 uses CUDA stream concurrency to overlap Q/K/V matmuls via CUDA Graphs with multi-stream event synchronization. However, on AMD GPUs (ROCm/HIP), HIP Graph replay with multi-stream event synchronization incurs 14× overhead on
hipGraphLaunch(0.178ms → 2.498ms), making stream-based concurrency counterproductive.QKV weight fusion provides an alternative approach for ROCm: instead of parallelizing 3 kernel launches, it eliminates 2 of them entirely by concatenating the weight matrices and using a single larger
mul_mat.Changes
src/llama-model.cppGGML_USE_HIPis defined and the architecture isLLM_ARCH_LLAMAorLLM_ARCH_LLAMA_EMBED:wq,wk,wvrow-wise into a singlewqkvtensor ([n_embd, n_out_q + n_out_k + n_out_v])ggml_backend_bufferlayer.wqkvfieldsrc/models/llama.cpp#if defined(GGML_USE_HIP):layer.wqkvexists, performs a singleggml_mul_matfollowed byggml_view_3dto split Q/K/V outputsbuild_lora_mm(wq/wk/wv)pathScope
#if defined(GGML_USE_HIP). Zero impact on CUDA, Metal, CPU, or other backends.wqkvpointer inllama_layer.Benchmark
Hardware: AMD Radeon AI PRO R9700 (gfx1201, RDNA4), ROCm 6.3
Model: Llama 2 7B Q4_0, single GPU,
tg128, HIP Graphs enabled