Vulkan: flash attention for DeepSeek models #584
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a cherry-pick of PR 14509 in mainline
llama.cppwith minor adaptations, and adds FA for the DeepSeek models to the Vulkan back-end.Caveats
perplexitywith default parameters where context is set to 512 tokens while batch size is 2048 tokens, one gets NaNs after the first context chunk. I have spent the better part of of the day trying to understand the reason, and just don't see it. Almost prepared to give a bounty to the person who finds the bug.fp16as I have not implemented the various additions required to make quantized cache work with DeepSeek models in the Vulkan back-end (quantized KV cache can of course be used with models that do not use MLA)I have tested with DeepSeek-V2-Lite on an RTX-4080 GPU with coopmat2 enabled. We are starting to see more significant performances gains compared to mainline
llama.cppas illustrated in the following two graphs. The first graph shows PP-2048 performance as a function of the number of tokens in the KV cacheN_KV. Surprisingly, we don't see significant performance gains frommla = 3compared tomla = 1as we do with CUDA (see below). Nevertheless, at 32k tokensik_llama.cppis about 40% faster thanllama.cpp.The next graph compares TG performance as a function of
N_KV. Here performance gains compared to mainline are even greater, withik_llama.cppnearly 2X faster thanllama.cppfor a context of 32 tokens.Before you get too excited about these results, a reminder that the Vulkan back-end does not yet implement the fused MoE
ffn_up+ffn_gateop, so it is still far behind CUDA. The next two graphs compare PP and TG performance as a function ofN_KVon the same RTX-4080 GPU.