Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Following in the foot steps of #185, this PR adds
IQ1_M_R4
, a 4-row interleaved version ofIQ1_M
.f16
super-block scale (replaced with af16
per row scale) and have changed the 3-bitIQ1_M
block scales with 4 bit. Hence, we end up using the same 1.75 bpw asIQ1_M
.IQ1_M_R4
with a block size of 32. I wanted to have this because DeepSeek-Lite, the model I'm testing with, has a lot of tensors with row sizes not divisible by 256, so a significant fraction of tensors gets quantized toIQ4_NL
when usingIQ1_M
llama.cpp
arrives at a context-512 perplexity (PPL(512)
in what follows) of 20.75 for DeepSeek-Lite using 2.74 bpw withIQ1_M
. TheIQ1_M_R4
quantization in this PR getsPPL-512 = 8.85
with 1.966 bpw for the repeating layers.IQ1_M_R4
is much faster on the CPU compared toIQ1_M
(see tables below). I never implemented iqk-style GEMM forIQ1_S/IQ1_M
, so these quantization types run at the snail speed of mainlinellama.cpp
.The following table compares prompt processing (pp512) and token generation (tg128) speed for LLaMA-3.1-8B on
AVX2
(Ryzen-5975WX),Zen4
(Ryzen-7950X) andARM_NEON
(M2-Max CPU). I didn't use DeepSeek-Lite for this comparison to avoid the difference in quantization types one ends up with due to not all tensors having row sizes that are multiple of 256.