IQ3_S: much faster CPU prompt processing #518

ikawrakow · 2025-06-11T13:51:07Z

Here a sweep-bench with this PR for LlaMA-3.1-8B on a Ryzen-7950X CPU

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.733	295.36	8.239	15.54
512	128	512	1.805	283.62	8.398	15.24
512	128	1024	1.857	275.73	8.561	14.95
512	128	1536	1.905	268.74	8.430	15.18
512	128	2048	1.954	261.97	8.563	14.95

I haven't done this for a while, but I think for this one worth looking at mainline llama.cpp (build: 5635 (3069e3169))

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	18.261	28.04	7.933	16.14
512	128	512	18.708	27.37	8.335	15.36
512	128	1024	19.048	26.88	8.547	14.98
512	128	1536	19.480	26.28	8.739	14.65
512	128	2048	19.670	26.03	8.912	14.36

10X faster PP here!

iq3_s: much faster GEMM via repacking to q8_0_r8

ec530d4

ikawrakow merged commit 4fc3cb4 into main Jun 12, 2025

ciprianveg mentioned this pull request Jun 12, 2025

Bug: tg speed drop after https://github.com/ikawrakow/ik_llama.cpp/pull/518 #523

Closed

This was referenced Jun 12, 2025

Faster CPU prompt processing for Q4_K and Q5_K #525

Merged

Much faster CPU prompt processing (part 1) #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

IQ3_S: much faster CPU prompt processing #518

IQ3_S: much faster CPU prompt processing #518

Uh oh!

ikawrakow commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

IQ3_S: much faster CPU prompt processing #518

IQ3_S: much faster CPU prompt processing #518

Uh oh!

Conversation

ikawrakow commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants