MLA: allow Q8_0 K-cache for MLA #206

ikawrakow · 2025-02-13T11:14:08Z

After PR #205 we have two KV caches left when using MLA:

kv_l - contiguous, not transposed
kvt_l - a transposed version of kv_l

kv_l can be quantized, and this PR adds the necessary changes.
kvl_t, being a transposed version of kv_l, cannot be quantized. It can be eliminated by setting MLA_USE_TRANSPOSED_CACHE to 0 in llama.cpp (but then kv_l cannot be quantized as making a contiguous transposed tensor out of a quantized tensor as needed during inference does not work at this point).

Apart from reducing required KV cache memory, a quantized kv_l cache can also slightly improve TG performance after a long prompt. Here is a comparison between the main branch and this PR for tg64@ppN for different prompt lengths N. Model is IQ4_XS quantized DeepSeek-Lite. The results for the main branch are for fp16 kv_l and kvt_l cache, the PR used Q8_0 for kv_l and bf16 for kvt_l (using bf16 only makes sense for a CPU with native support, such as the Ryzen-7950X used to run the benchmark)

model	test	t/s (main)	t/s (PR)	Speedup
deepseek2 16B IQ4_XS	tg64@pp128	33.80 ± 0.00	33.67 ± 0.01	0.996
deepseek2 16B IQ4_XS	tg64@pp256	32.76 ± 0.01	33.55 ± 0.01	1.024
deepseek2 16B IQ4_XS	tg64@pp512	32.68 ± 0.05	32.31 ± 0.00	0.989
deepseek2 16B IQ4_XS	tg64@pp1024	32.02 ± 0.00	32.07 ± 0.00	1.002
deepseek2 16B IQ4_XS	tg64@pp2048	30.31 ± 0.03	30.93 ± 0.00	1.020
deepseek2 16B IQ4_XS	tg64@pp4096	27.54 ± 0.10	28.79 ± 0.07	1.045
deepseek2 16B IQ4_XS	tg64@pp8192	23.12 ± 0.01	25.21 ± 0.02	1.090
deepseek2 16B IQ4_XS	tg64@pp16384	18.74 ± 0.09	19.81 ± 0.05	1.057

MLA: allow Q8_0 K-cache for MLA

a0ee859

ikawrakow merged commit 8e94b29 into main Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MLA: allow Q8_0 K-cache for MLA #206

MLA: allow Q8_0 K-cache for MLA #206

Uh oh!

ikawrakow commented Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

MLA: allow Q8_0 K-cache for MLA #206

MLA: allow Q8_0 K-cache for MLA #206

Uh oh!

Conversation

ikawrakow commented Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants