-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deepseek MLA Optimizations #180
Conversation
Co-authored-by: Stanisław Szymczyk <[email protected]>
This hurts prompt processing (a.k.a prefill) speed very significantly. Here is what I get for Deepseek2-Lite
TG is indeed better, and advantage increases with KV cache size. But if we have to wait twice as long to process the prompt, it will take quite a few generated tokens to recover the time lost in the prefill step. I think we need to either try to understand why the attention part is so much slower when processing batches of tokens and fix it, or simply wait for @fairydreaming to fix their PR. |
Here is how much time is being spent in the various matrix multiplications in the attention part when processing a prompt of 8192 tokens:
And here is with this PR:
I.e., attention is 2.5X slower with the PR. In addition, I'm finding that on the main branch Maybe this can be useful when trying to optimize. |
Changed to draft. PP does seem to have regressions, I'll have direct comparisons against old version soon, generating an iq4_k_r4 quant now (PP in main for me was 11.5 t/s for iq4_k and 9.8 t/s for iq4_k_r4 at pp512, 9.22 t/s at PP1024 for IQ4_K).
Thank you for the op time breakdown. I was drawn in to this PR for the TG benefits, it should have also been a draft for the reason that it would mean GGUF's wouldn't be cross compatible, as this is also a draft in llama.cpp. I just want to have it here because it does optimize for a workload where TG dominates, and R1 as a reasoning model it often does. |
@saood06 Perhaps a good way to move forward is to add an additional architecture ( |
I'll do that. I'll still leave it in a draft as I'm waiting to see how it progresses in llama.cpp, and for me to more thoroughly evaluate how it performs at long prompt lengths vs main. |
So, as far as I can tell, the attention implementation in this PR leads to ~3X more multiply-adds (madds) when performing matrix multiplications. For prompt processing here we need These figures are of course specific to the Deepseek2-Lite model. It may be different for a much larger model where rank-512 decomposition may really be "low-rank". It isn't for this model relative to the head sizes, number of heads, and hidden dimension. |
@ikawrakow I think applying the trick with "absorbing" matrices mentioned in the DeepSeek V2 paper shall fix this, I'm working on that. |
Great! Btw, I observe that |
No it's not needed for the current version of the code, I will remove it later once I settle on a final set of weights. |
@ikawrakow Unfortunately the idea with speeding things up thanks to the matrix absorption is wrong: ggml-org/llama.cpp#11446 (comment) I'm not sure why they mentioned it in the DeepSeek paper. Regarding other possible optimizations do you know how much work is needed to add support for multiplication of transposed matrices to ggml_mul_mat()? The problem is that I use kv cache for multiplication first directly and then in transposed form. I got around this problem by storing kv cache in both regular and transposed forms, but it doubles the amount of required memory. |
I took a look at the Deepseek-R1 model (out of my league memory and disk space wise), so even there rank-512 cannot be really considered "low-rank" (and it seems it is rank-1536 for Concerning multiplication of transposed matrices in I think one should make Flash Attention work with different K and V head sizes. I did a quick attempt but it doesn't look like I found all the places where Out of curiosity, did you ever try this repository with your Epyc CPU? |
Sure, I checked it a while ago (before the optimization work): Regular llama.cpp:
ik_llama.cpp:
Generation was ~4.6% faster, while prompt processing was ~90% faster, impressive! |
10 t/s TG for Deepseek-R1 - wow! PP should be ~50% faster now for I'm playing with Deepseek-Lite and I'm finding that the CUDA performance is pretty bad - 3500 t/s for PP-512 and 142 t/s for TG-128 on an RTX-4080. This is for |
I ran batched-bench at batch size 1 with TG at 32 at various PP to show PP performance and TG performance at different context lengths. Batched-bench numbers are noisy because they do not use repetitions like llama-bench and this model on this machine seems to have some variance, but all data is shown after dropping the cache's and running the model until it is fully in the page cache. IQ4_K_R4 with this PR:
IQ4_K_R4 on main:
Looking at the 8K context results, PP does drop from 5.89 to 4.05, but TG jumps from 0.74 to 2.00. At q8_0 (results below) PP again drops 6.06 to 4.03, but TG benefits going from 0.99 to 1.94. I would test/run this model at even higher context, but I would either need a smaller quant or to use RPC (for reference the F16/F16 KV cache at n_ctx 8224 is 40,233.55 MiB) Expand to see more runs with q8_0 and q6_0 K cache tested as wellPR with q6_0 K cache:
PR with q8_0 K cache:
Second run of PR without K cache quantization:
main with q6_0 K cache:
main with q8_0 K cache:
If that happened it would also have the benefit of allowing V cache quantization (not sure why FA is needed for that), which this model could really benefit from in it's current implementation which uses the space of MHA. A proper MLA implementation would take up far less space.
Other people have reported poor performance even for the larger Deepseek models with TG at 10-14 t/s (although with an IQ1 based quant) even fully offloaded with datacenter GPU's, and around the same performance for a 192GB Mac.
Partial offload is reported to benefit from this: ggml-org/llama.cpp#11397 and it is something I plan to test/use. |
Because without FA
I just made Deepseek-Lite also work on my Mac (M2-Max). I get TG-128 = 70 t/s on the CPU using |
This is something that I kind of intuitively expected, I mean the whole point of DeepSeek MLA is to reduce KV cache memory size by storing the "compressed" latent representation of KV vectors, but we still have to perform additional calculations to "decompress" and use them to calculate attentions scores and attention output. |
This is superseded by #188. Closing |
Just saw your linked post. I see you have a slightly faster prompt processing speed, but what I'm confused about is why when I have everything on the GPU apart from the 3 sets of non-shared experts' tensors, why batch processing it's gaining anything hardly, eg:
|
Can you try this fork, without MLA and this PR: #200 which adds FA support. This should be the fastest prompt processing you can do. Fairydreaming on his system with this fork without MLA and without FA and more optimizations reported 50 tok/s. #180 (comment) If you want to try MLA, just use the -mla flag, which will turn MLA on. |
Thanks - I will do, but it will probably be a couple of days due to running another experiment. |
Very direct port of ggml-org/llama.cpp#11446
Tested working with Q4_K_S on dual socket Xeon E5-2690 v3, performance compared with llama.cpp below.
Tests in: #180 (comment)
This PR also contains things I missed in my last PR in the convert_hf_to_gguf.py.
@ikawrakow
Is there any chance to convert old imatrix files (such as this) to include the components you get from splitting kv_b included in it. I'm not sure how impactful missing them would be as right now it obviously prints "did not find weights for attn_k_b.weight/attn_v_b.weight". I do not have the capability to generate new imatrix.dat files, and it would be nice if it wasn't needed as it is quite resource intensive to do.