Deepseek MLA Optimizations V2 #195

saood06 · 2025-02-08T19:47:41Z

This PR contains the following things

A fairydreaming commit that is supposed to increase PP
Avoid allocating the MHA KV cache in MLA mode
Adds a change I originally missed that is used for gguf-py.

I will follow up with:

Having all the MoE experts load during warmup, that can be placed in this PR if you want, or a separate one. It is a very large QoL feature for large MoE. Without it the model is slowly loaded in on use, with it, the model is loaded immediately and at a faster rate.
The mmap based KV cache buffer, it is functional but I have yet to make it a CLI option.

Co-authored-by: Stanisław Szymczyk <[email protected]>

ikawrakow

Looks good. I added a minor change to check if wk_b and wv_b are available before turning on MLA (so we don't crash if someone is using an old model and asked for MLA).

PP-4096 for Q8_0_R8 quantized DeepSeek-Lite with -mla goes up to 292 t/s from 275 t/s with this change.

* Deepseek MLA Optimizations Co-authored-by: Stanisław Szymczyk <[email protected]> * Make MLA optional * Remove some unnecessary copies in the MLA attention * Deepseek MLA Optimizations V2 (#195) * Avoid allocating MHA KV cache when MLA is turned on * Added missing gguf-py file * Added final optimizations Co-authored-by: Stanisław Szymczyk <[email protected]> * Make sure we do have wk_b and wv_b before enabling MLA --------- Co-authored-by: Stanisław Szymczyk <[email protected]> Co-authored-by: Iwan Kawrakow <[email protected]> * Use type_k and type_v to set the types of the MLA caches They were hard-coded at f16. On my Ryzen-7950X with native bf16 support I get a fairly significant PP performance boost with bf16 KV-cache: PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache. * Better gemm strategy when nth > nhead It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads (with or without MLA). Before this commit, when nth > nhead heads were processed sequentially with all nth threads participating in each matrix multiplication. Now we ind the gcd of nhead and nth and split threads into nth/gcd groups, each group processing nhead/gcd heads. --------- Co-authored-by: Saood Karim <[email protected]> Co-authored-by: Stanisław Szymczyk <[email protected]> Co-authored-by: Iwan Kawrakow <[email protected]>

saood06 and others added 3 commits February 8, 2025 12:47

Avoid allocating MHA KV cache when MLA is turned on

f0227c4

Added missing gguf-py file

57d2702

Added final optimizations

7cdb0a1

Co-authored-by: Stanisław Szymczyk <[email protected]>

saood06 mentioned this pull request Feb 8, 2025

Add optional MLA #188

Merged

Make sure we do have wk_b and wv_b before enabling MLA

bf1d056

ikawrakow self-requested a review February 9, 2025 07:33

ikawrakow approved these changes Feb 9, 2025

View reviewed changes

ikawrakow merged commit d58dee8 into ik/mla Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deepseek MLA Optimizations V2 #195

Deepseek MLA Optimizations V2 #195

Uh oh!

saood06 commented Feb 8, 2025 •

edited

Loading

Uh oh!

ikawrakow left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Deepseek MLA Optimizations V2 #195

Deepseek MLA Optimizations V2 #195

Uh oh!

Conversation

saood06 commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saood06 commented Feb 8, 2025 •

edited

Loading