vulkan: add Flash Attention support for BFloat16 KV cache. by 0cc4m · Pull Request #23420 · ggml-org/llama.cpp

0cc4m · 2026-05-20T13:39:21Z

Overview

This PR adds FA support for symmetrical use of bfloat16 kv cache in the Vulkan backend, meaning it only supports both k and v in bfloat16 format. Because there is no general arithmetic support for bfloat16, the non-coopmat path uses the scalar float32-fallback path.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, Claude wrote the changes, I reviewed and tested them

…3420) * vulkan: add flash attention bf16 kv support * vulkan: bf16 FA coopmat1 support * vulkan: bf16 FA coopmat2 support * fix FA bf16 f32 fallback * fix FA bf16 coopmat1 shader * fix FA bf16 coopmat2 shader * code cleanup * cleanup comment change * address feedback * add O_TYPE for cm2 FA * use O_TYPE for gqaStore function * reduce BFLOAT16 ifdefs

…wercase * upstream/master: (27 commits) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ui: fix ETag truncation with MSVC compiler (ggml-org#23917) docs : update ZenDNN docs for Q8 support (ggml-org#23791) llama: only use one iGPU device by default (ggml-org#23897) webui: add custom CSS injection via config (ggml-org#23904) Support `-fa auto` in llama-bench (ggml-org#23714) opencl: support bf16 by converting to f16 (ggml-org#23839) ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910) TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843) metal : restore im2col implementation for large kernels (ggml-org#23901) test: (test-llama-archs) log the config name first (ggml-org#23885) ci : update ios-xcode release job to macos-26 (ggml-org#23906) ggml : add some lsx support (ggml-org#23798) vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420) ci : fix s390x release job (ggml-org#23898) ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895) llama : do not skip iGPU when only RPC devices are present (ggml-org#23868) server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884) ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879) ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760) ... # Conflicts: # gguf-py/gguf/vocab.py # src/llama-vocab.cpp

…3420) * vulkan: add flash attention bf16 kv support * vulkan: bf16 FA coopmat1 support * vulkan: bf16 FA coopmat2 support * fix FA bf16 f32 fallback * fix FA bf16 coopmat1 shader * fix FA bf16 coopmat2 shader * code cleanup * cleanup comment change * address feedback * add O_TYPE for cm2 FA * use O_TYPE for gqaStore function * reduce BFLOAT16 ifdefs

0cc4m requested a review from a team as a code owner May 20, 2026 13:39

0cc4m changed the title ~~vulkan: add support for BFloat16 KV cache.~~ vulkan: add Flash Attention support for BFloat16 KV cache. May 20, 2026

jeffbolznv reviewed May 20, 2026

View reviewed changes

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 20, 2026

jeffbolznv reviewed May 21, 2026

View reviewed changes

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp

Comment thread ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp Outdated

Comment thread ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp

jeffbolznv approved these changes May 26, 2026

View reviewed changes

0cc4m added 12 commits May 28, 2026 13:44

vulkan: add flash attention bf16 kv support

b97a6c4

vulkan: bf16 FA coopmat1 support

8f486e9

vulkan: bf16 FA coopmat2 support

bce8d9e

fix FA bf16 f32 fallback

58d2e83

fix FA bf16 coopmat1 shader

53356b4

fix FA bf16 coopmat2 shader

65a14bf

code cleanup

1383a83

cleanup comment change

65e493e

address feedback

141c739

add O_TYPE for cm2 FA

51063b1

use O_TYPE for gqaStore function

1b71ecb

reduce BFLOAT16 ifdefs

48f0c0a

0cc4m force-pushed the 0cc4m/vulkan-fa-bf16 branch from a8595b4 to 48f0c0a Compare May 28, 2026 12:05

CISC approved these changes May 28, 2026

View reviewed changes

0cc4m merged commit 6e093b8 into master May 30, 2026
34 checks passed

0cc4m deleted the 0cc4m/vulkan-fa-bf16 branch May 30, 2026 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: add Flash Attention support for BFloat16 KV cache.#23420

vulkan: add Flash Attention support for BFloat16 KV cache.#23420
0cc4m merged 12 commits into
masterfrom
0cc4m/vulkan-fa-bf16

0cc4m commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0cc4m commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

0cc4m commented May 20, 2026 •

edited

Loading