Better CPU FA performance for DeepSeek-Lite #410

ikawrakow · 2025-05-12T07:21:14Z

This FA tweak improves DeepSeek-Lite CPU TG performance with Q8_0 KV cache.

Not sure if it will have a positive impact for the large DeepSeek models. To optimize the FA strategy for those I need to be able to test, which I cannot atm.

The graph shows a comparison between the main branch and this PR for a Q4_0 quantized DeepSeek-Lite model. The CPU is Ryzen-7950X. The x-axis is N_KV/1000, where N__KV is the number of tokens in the K cache, which is quantized with Q8_0. The sweep-bench command was

./bin/llama-sweep-bench -m $model -c 16384 -ub 1024 -t 16 -mla 3 -fmoe -fa -rtr

Main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	1.488	688.02	7.112	35.99
1024	256	1024	1.674	611.73	7.361	34.78
1024	256	2048	1.788	572.75	7.524	34.02
1024	256	3072	1.951	524.97	7.728	33.13
1024	256	4096	2.104	486.65	7.927	32.29
1024	256	5120	2.276	449.93	8.152	31.40
1024	256	6144	2.483	412.40	8.441	30.33
1024	256	7168	2.841	360.45	8.795	29.11
1024	256	8192	2.794	366.55	9.294	27.54
1024	256	9216	2.974	344.36	9.142	28.00
1024	256	10240	3.130	327.15	9.404	27.22
1024	256	11264	3.328	307.69	9.654	26.52
1024	256	12288	3.499	292.67	10.078	25.40
1024	256	13312	3.840	266.70	10.536	24.30
1024	256	14336	3.886	263.53	10.969	23.34
1024	256	15360	4.055	252.52	11.430	22.40

PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	1.469	696.86	7.126	35.93
1024	256	1024	1.601	639.65	7.322	34.96
1024	256	2048	1.759	582.03	7.446	34.38
1024	256	3072	1.920	533.47	7.673	33.36
1024	256	4096	2.081	491.98	7.728	33.13
1024	256	5120	2.282	448.64	7.852	32.60
1024	256	6144	2.413	424.33	7.991	32.04
1024	256	7168	2.626	389.95	8.122	31.52
1024	256	8192	2.753	372.02	8.238	31.08
1024	256	9216	2.934	348.97	8.394	30.50
1024	256	10240	3.159	324.17	8.538	29.98
1024	256	11264	3.299	310.44	8.668	29.53
1024	256	12288	3.501	292.47	8.818	29.03
1024	256	13312	3.684	277.98	8.969	28.54
1024	256	14336	4.074	251.37	9.089	28.16
1024	256	15360	4.086	250.63	9.167	27.93

saood06 · 2025-05-20T08:19:37Z

I did end up doing a fresh build, drop cache and server launch and have used it up to 32K tokens (double where I normally test sweep-bench), and my informal results are that it is about the same, maybe a little better. I don't see the same large improvement that seems to scale with context size that you do.

I may run a full sweep-bench later to get a better comparison, I only ran it at very low amounts just to validate the model was warmed up and running at normal speeds ( I usually do this before launching server) and it performed about the same.

ikawrakow · 2025-05-20T09:00:18Z

I don't see the same large improvement that seems to scale with context size that you do.

There is something different with the big siblings of DeepSeek-Lite that I haven't understood what it is. For one, IIRC your TG performance drops 3X when you go to 16k tokens, while in my case it is still at ~60% even before the PR. The self-attention part per layer in DeepSeek-V3/R1 is 8X of DeepSeek-Lite (128 instead of 16 heads for otherwise identical tensor dimensions). The FFN part is about 7X (7168 x 2048 vs 2048 x 1408 + 8 active experts instead of 6) per layer, so I don't really see a reason why it should behave differently. If anything, with -mla 3 -fa my expectation would be that the big model TG performance decreases less with context size as the K-cache is smaller relative to the amount of computations that need to get done. So, I guess, it is somehow related to NUMA, so it is bottle-necked on that when computing self-attention. If so, yes, you probably will not see (significant) performance improvement.

saood06 · 2025-05-20T09:19:47Z

I don't see the same large improvement that seems to scale with context size that you do.

So, I guess, it is somehow related to NUMA, so it is bottle-necked on that when computing self-attention. If so, yes, you probably will not see (significant) performance improvement.

I'm not sure because it has good local hitrate on TG see this: #201 (comment)

ikawrakow · 2025-05-20T09:44:56Z

I'm not sure because it has good local hitrate on TG see this: #201 (comment)

The high local TG hit rate is measured at what context?

saood06 · 2025-05-20T09:56:22Z

I'm not sure because it has good local hitrate on TG see this: #201 (comment)

The high local TG hit rate is measured at what context?

32k

Iwan Kawrakow added 2 commits May 12, 2025 09:25

Better CPU FA performance for DeepSeek-Lite

7b12063

It must be like this

83dab6a

ikawrakow merged commit 553c08b into main May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Better CPU FA performance for DeepSeek-Lite #410

Better CPU FA performance for DeepSeek-Lite #410

Uh oh!

ikawrakow commented May 12, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Better CPU FA performance for DeepSeek-Lite #410

Better CPU FA performance for DeepSeek-Lite #410

Uh oh!

Conversation

ikawrakow commented May 12, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

ikawrakow commented May 20, 2025

Uh oh!

saood06 commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants