Skip to content

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Jun 11, 2025

This PR is a follow up of #515, and applies the same technique to IQ3_XXS. We see nearly 3X increase in prompt processing speed compared to IQ3_XXS, and over 2X compared to IQ3_XXS_R4.

Sweep-bench for pure IQ3_XXS quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:

IQ3_XXS, main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 5.023 101.94 7.365 17.38
512 128 512 5.281 96.96 8.088 15.83
512 128 1024 5.170 99.03 7.977 16.05
512 128 1536 5.324 96.16 7.942 16.12
512 128 2048 5.389 95.02 8.043 15.91

IQ3_XXS_R4, main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.836 133.47 7.675 16.68
512 128 512 3.687 138.87 8.279 15.46
512 128 1024 3.805 134.57 8.245 15.53
512 128 1536 3.906 131.08 8.252 15.51
512 128 2048 4.076 125.61 8.545 14.98

IQ3_XXS, PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.730 296.01 7.641 16.75
512 128 512 1.807 283.30 8.333 15.36
512 128 1024 1.896 269.98 8.070 15.86
512 128 1536 1.978 258.78 8.481 15.09
512 128 2048 2.062 248.32 8.514 15.03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants