Skip to content

Conversation

@ikawrakow
Copy link
Owner

These two can use the more efficient block-of-32 MMQ GEMM kernels, so having MMQ implementation for them makes sense.

Sweep bench for LlaMA-3-8B on RTX-4080

Main branch IQ4_KS_R4

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.347 5910.02 4.052 126.34
2048 512 2048 0.325 6301.19 4.172 122.74
2048 512 4096 0.350 5848.94 4.417 115.92
2048 512 6144 0.378 5421.23 4.641 110.33
2048 512 8192 0.405 5052.95 4.863 105.28
2048 512 10240 0.432 4742.63 5.116 100.08
2048 512 12288 0.459 4459.86 5.302 96.57
2048 512 14336 0.486 4212.60 5.562 92.05

PR IQ4_KS_R4

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.281 7277.25 3.943 129.86
2048 512 2048 0.307 6674.86 4.159 123.12
2048 512 4096 0.335 6119.79 4.419 115.86
2048 512 6144 0.360 5681.17 4.648 110.16
2048 512 8192 0.389 5263.35 4.865 105.23
2048 512 10240 0.416 4927.52 5.118 100.05
2048 512 12288 0.443 4620.54 5.302 96.57
2048 512 14336 0.473 4330.80 5.557 92.14

Main branch IQ5_KS_R4

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.338 6052.15 4.674 109.55
2048 512 2048 0.326 6272.90 4.892 104.66
2048 512 4096 0.353 5800.11 5.149 99.43
2048 512 6144 0.380 5387.80 5.379 95.18
2048 512 8192 0.406 5041.40 5.597 91.48
2048 512 10240 0.434 4720.43 5.854 87.47
2048 512 12288 0.460 4451.96 6.037 84.81
2048 512 14336 0.489 4188.39 6.289 81.41

PR IQ5_KS_R4

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.288 7118.13 4.669 109.66
2048 512 2048 0.313 6538.56 4.890 104.71
2048 512 4096 0.339 6034.98 5.149 99.44
2048 512 6144 0.368 5570.61 5.389 95.01
2048 512 8192 0.394 5193.25 5.619 91.12
2048 512 10240 0.422 4848.53 5.862 87.35
2048 512 12288 0.449 4562.94 6.045 84.70
2048 512 14336 0.479 4271.15 6.297 81.30

@ikawrakow ikawrakow merged commit 8ffad18 into main Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant