MLA: allow Q8_0 K-cache for MLA #206
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After PR #205 we have two KV caches left when using MLA:
kv_l- contiguous, not transposedkvt_l- a transposed version ofkv_lkv_lcan be quantized, and this PR adds the necessary changes.kvl_t, being a transposed version ofkv_l, cannot be quantized. It can be eliminated by settingMLA_USE_TRANSPOSED_CACHEto 0 inllama.cpp(but thenkv_lcannot be quantized as making a contiguous transposed tensor out of a quantized tensor as needed during inference does not work at this point).Apart from reducing required KV cache memory, a quantized
kv_lcache can also slightly improve TG performance after a long prompt. Here is a comparison between the main branch and this PR fortg64@ppNfor different prompt lengthsN. Model isIQ4_XSquantized DeepSeek-Lite. The results for the main branch are forfp16kv_landkvt_lcache, the PR usedQ8_0forkv_landbf16forkvt_l(usingbf16only makes sense for a CPU with native support, such as the Ryzen-7950X used to run the benchmark)