Skip to content

Conversation

@ikawrakow
Copy link
Owner

This PR is a cherry-pick of PR 14509 in mainline llama.cpp with minor adaptations, and adds FA for the DeepSeek models to the Vulkan back-end.

Caveats

  • The batch size cannot be greater than the maximum context length. Under normal usage this is never the case, but if one runs perplexity with default parameters where context is set to 512 tokens while batch size is 2048 tokens, one gets NaNs after the first context chunk. I have spent the better part of of the day trying to understand the reason, and just don't see it. Almost prepared to give a bounty to the person who finds the bug.
  • For now KV cache can only be fp16 as I have not implemented the various additions required to make quantized cache work with DeepSeek models in the Vulkan back-end (quantized KV cache can of course be used with models that do not use MLA)

I have tested with DeepSeek-V2-Lite on an RTX-4080 GPU with coopmat2 enabled. We are starting to see more significant performances gains compared to mainline llama.cpp as illustrated in the following two graphs. The first graph shows PP-2048 performance as a function of the number of tokens in the KV cache N_KV. Surprisingly, we don't see significant performance gains from mla = 3 compared to mla = 1 as we do with CUDA (see below). Nevertheless, at 32k tokens ik_llama.cpp is about 40% faster than llama.cpp.

vulkan_dsl2_pp

The next graph compares TG performance as a function of N_KV. Here performance gains compared to mainline are even greater, with ik_llama.cpp nearly 2X faster than llama.cpp for a context of 32 tokens.

vulkan_dsl2_tg

Before you get too excited about these results, a reminder that the Vulkan back-end does not yet implement the fused MoE ffn_up+ffn_gate op, so it is still far behind CUDA. The next two graphs compare PP and TG performance as a function of N_KV on the same RTX-4080 GPU.

vulkan_dsl2_vs_cuda_pp
vulkan_dsl2_vs_cuda_tg

jeffbolznv and others added 2 commits July 4, 2025 11:11
* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants