Skip to content

vulkan: add Flash Attention support for BFloat16 KV cache.#23420

Merged
0cc4m merged 12 commits into
masterfrom
0cc4m/vulkan-fa-bf16
May 30, 2026
Merged

vulkan: add Flash Attention support for BFloat16 KV cache.#23420
0cc4m merged 12 commits into
masterfrom
0cc4m/vulkan-fa-bf16

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented May 20, 2026

Overview

This PR adds FA support for symmetrical use of bfloat16 kv cache in the Vulkan backend, meaning it only supports both k and v in bfloat16 format. Because there is no general arithmetic support for bfloat16, the non-coopmat path uses the scalar float32-fallback path.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, Claude wrote the changes, I reviewed and tested them

@0cc4m 0cc4m requested a review from a team as a code owner May 20, 2026 13:39
@0cc4m 0cc4m changed the title vulkan: add support for BFloat16 KV cache. vulkan: add Flash Attention support for BFloat16 KV cache. May 20, 2026
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp
Comment thread ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp
Comment thread ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp Outdated
Comment thread ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl Outdated
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 20, 2026
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp
Comment thread ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp Outdated
Comment thread ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-fa-bf16 branch from a8595b4 to 48f0c0a Compare May 28, 2026 12:05
@0cc4m 0cc4m merged commit 6e093b8 into master May 30, 2026
34 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-fa-bf16 branch May 30, 2026 08:39
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
…3420)

* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
o7si added a commit to o7si/llama.cpp that referenced this pull request May 31, 2026
…wercase

* upstream/master: (27 commits)
  vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
  ui: fix ETag truncation with MSVC compiler (ggml-org#23917)
  docs : update ZenDNN docs for Q8 support (ggml-org#23791)
  llama: only use one iGPU device by default (ggml-org#23897)
  webui: add custom CSS injection via config (ggml-org#23904)
  Support `-fa auto` in llama-bench (ggml-org#23714)
  opencl: support bf16 by converting to f16 (ggml-org#23839)
  ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910)
  TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843)
  metal : restore im2col implementation for large kernels (ggml-org#23901)
  test: (test-llama-archs) log the config name first (ggml-org#23885)
  ci : update ios-xcode release job to macos-26 (ggml-org#23906)
  ggml : add some lsx support (ggml-org#23798)
  vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420)
  ci : fix s390x release job (ggml-org#23898)
  ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895)
  llama : do not skip iGPU when only RPC devices are present (ggml-org#23868)
  server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)
  ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)
  ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)
  ...

# Conflicts:
#	gguf-py/gguf/vocab.py
#	src/llama-vocab.cpp
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
…3420)

* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants