Skip to content

Conversation

@saood06
Copy link
Collaborator

@saood06 saood06 commented Feb 8, 2025

@ikawrakow

This PR contains the following things

  • A fairydreaming commit that is supposed to increase PP
  • Avoid allocating the MHA KV cache in MLA mode
  • Adds a change I originally missed that is used for gguf-py.

I will follow up with:

  • Having all the MoE experts load during warmup, that can be placed in this PR if you want, or a separate one. It is a very large QoL feature for large MoE. Without it the model is slowly loaded in on use, with it, the model is loaded immediately and at a faster rate.
  • The mmap based KV cache buffer, it is functional but I have yet to make it a CLI option.

@saood06 saood06 mentioned this pull request Feb 8, 2025
@ikawrakow ikawrakow self-requested a review February 9, 2025 07:33
Copy link
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I added a minor change to check if wk_b and wv_b are available before turning on MLA (so we don't crash if someone is using an old model and asked for MLA).

PP-4096 for Q8_0_R8 quantized DeepSeek-Lite with -mla goes up to 292 t/s from 275 t/s with this change.

@ikawrakow ikawrakow merged commit d58dee8 into ik/mla Feb 9, 2025
ikawrakow added a commit that referenced this pull request Feb 9, 2025
* Deepseek MLA Optimizations

Co-authored-by: Stanisław Szymczyk <[email protected]>

* Make MLA optional

* Remove some unnecessary copies in the MLA attention

* Deepseek MLA Optimizations V2 (#195)

* Avoid allocating MHA KV cache when MLA is turned on

* Added missing gguf-py file

* Added final optimizations

Co-authored-by: Stanisław Szymczyk <[email protected]>

* Make sure we do have wk_b and wv_b before enabling MLA

---------

Co-authored-by: Stanisław Szymczyk <[email protected]>
Co-authored-by: Iwan Kawrakow <[email protected]>

* Use type_k and type_v to set the types of the MLA caches

They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.

* Better gemm strategy when nth > nhead

It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.

---------

Co-authored-by: Saood Karim <[email protected]>
Co-authored-by: Stanisław Szymczyk <[email protected]>
Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants