-
Couldn't load subscription status.
- Fork 155
Enable faster prompt processing with mainline llama.cpp GGUFs #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Else they don't get run-time repacked.
|
Testing this PR (on top of #405 and #408 PRs), here's a complete log when loading DeepSeek V3 0324 Q2_K_XL. Notably, I had to reduce 1 layer on CUDA 2 (compared to #405 (comment)), as now CUDA 2 was getting OOM. I noticed the compute buffers are ~3.3GB each instead of 2GB and 400MB respectively for each despite using the -fa flag with -mla 3. I noticed about at 15% improvement on PP t/s over #405 PR, so then that means about 21% faster PP vs main llamacpp (and like 400% improvement (no joke lol) without the #405 PR on ik llamacpp) Testing with -mla 2, compute buffers are 3.4GB as well vs -mla 3 with -fa. Here it got a small perf improvement (109 t/s PP vs 106 t/s PP). EDIT: I noticed that with this PR we have to specify -mla 1 to make compute buffers smaller, as it doesn't automatically changes it from 0 to 1. |
|
The compute buffers become larger because one needs extra buffers for the transformed cache. If you are running out of VRAM, you can reduce the compute buffer size using e.g. The extra ~1 GiB in model size is for the newly created |
Mainline llama.cpp PR 12901, which added MLA support for DeepSeek models 2.5 months after MLA was available here, broke backwards compatibility. As a result,
the new DeepSeek GGUFs that started appearing on HF became compatible with
ik_llama.cpp, so I added support for the incompatible GGUFs in #394. But using such crippled DeepSeek GGUF results in a much lower prompt processing performance. This is because theattn_wkv_btensor is missing, so one cannot usemla = 3.This PR removes this limitation. When
-mla 0 or 2 or 3is specified on the command line, missingattn_wkv_btensors are created on-the-fly while loading the model. This is basically the reverse of #259, where theattn_wk_bandattn_wv_btensors necessary for MLA were computed from theattn_wkv_btensors in the original DeepSeek GGUFs.To show why this is useful, the following graph compares PP performance between the main branch and this PR. The
sweep-benchcommand isThe model is a mainline
llama.cppDeepSeek-Lite GGUF with theattn_wkv_btensors missing. In that case themla = 3parameter will be converted tomla = 1on the main branch, but trigger the generation of theattn_wkv_btensors in this PR (somla = 3can be used). The model is quantized withQ4_0, the GPU is RTX-4080. The x-axis isN_KV/1000, whereN_KVis the number of tokens in the KV cache. I have used a logarithmic scale for the y axis to better show the growing difference in performance with increasingN_KV.