UPSTREAM PR #19281: vulkan: Preprocess FA mask to detect all-neg-inf and all-zero.#1145
UPSTREAM PR #19281: vulkan: Preprocess FA mask to detect all-neg-inf and all-zero.#1145
Conversation
Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).
|
No meaningful performance changes were detected across 115426 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-bench, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-gemma3-cli, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so. 🔎 Full breakdown: Loci Inspector. |
048ad94 to
6c1fde6
Compare
|
@loci-dev how many files were changed in this pr ? |
|
Based on the repository analysis, 8 files were changed in this PR: Changed Files BreakdownModified Files (7):
Added Files (1):
What This PR DoesThis is a focused optimization for the Vulkan backend's flash attention implementation. The changes add a mask preprocessing optimization that:
This optimization is particularly beneficial for long-context inference scenarios where attention masks are large and may contain patterns that can be handled more efficiently. |
823244c to
bab7d39
Compare
c001e9f to
4a5a4c2
Compare
13648e6 to
1d064d0
Compare
551dfb5 to
55a969e
Compare
5ac00d6 to
998dd7a
Compare
Note
Source pull request: ggml-org/llama.cpp#19281
Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases.
Apply this optimization when the mask is relatively large (i.e. prompt processing).