IQ1_M_R4: better 1.75 bpw quants #187

ikawrakow · 2025-02-06T09:16:09Z

Following in the foot steps of #185, this PR adds IQ1_M_R4, a 4-row interleaved version of IQ1_M.

I have removed the f16 super-block scale (replaced with a f16 per row scale) and have changed the 3-bit IQ1_M block scales with 4 bit. Hence, we end up using the same 1.75 bpw as IQ1_M.
The above change allows to implement IQ1_M_R4 with a block size of 32. I wanted to have this because DeepSeek-Lite, the model I'm testing with, has a lot of tensors with row sizes not divisible by 256, so a significant fraction of tensors gets quantized to IQ4_NL when using IQ1_M
Quantization mixes for MoE models are adjusted. Today's mainline llama.cpp arrives at a context-512 perplexity (PPL(512) in what follows) of 20.75 for DeepSeek-Lite using 2.74 bpw with IQ1_M. The IQ1_M_R4 quantization in this PR gets PPL-512 = 8.85 with 1.966 bpw for the repeating layers.
IQ1_M_R4 is much faster on the CPU compared to IQ1_M (see tables below). I never implemented iqk-style GEMM for IQ1_S/IQ1_M, so these quantization types run at the snail speed of mainline llama.cpp.
Caveat: it is CPU only for now.

The following table compares prompt processing (pp512) and token generation (tg128) speed for LLaMA-3.1-8B on AVX2 (Ryzen-5975WX), Zen4 (Ryzen-7950X) and ARM_NEON (M2-Max CPU). I didn't use DeepSeek-Lite for this comparison to avoid the difference in quantization types one ends up with due to not all tensors having row sizes that are multiple of 256.

platform	threads	test	t/s (IQ1_M)	t/s (IQ1_M_R4)	Speedup
AVX2	32	pp512	43.98 ± 0.09	187.94 ± 0.21	4.273
Zen4	16	pp512	26.70 ± 0.03	149.57 ± 0.31	5.602
NEON	8	pp512	17.61 ± 0.03	95.04 ± 0.16	5.397
AVX2	2	tg128	2.66 ± 0.00	3.96 ± 0.00	1.489
	4	tg128	5.25 ± 0.00	7.76 ± 0.00	1.478
	8	tg128	9.93 ± 0.16	13.71 ± 0.01	1.381
	16	tg128	17.14 ± 0.00	22.60 ± 0.01	1.319
	32	tg128	23.91 ± 0.01	25.39 ± 0.02	1.062
Zen4	2	tg128	3.39 ± 0.00	5.29 ± 0.00	1.560
	4	tg128	6.50 ± 0.00	10.19 ± 0.00	1.568
	8	tg128	11.68 ± 0.01	17.54 ± 0.01	1.502
	16	tg128	19.13 ± 0.05	25.91 ± 0.43	1.354
NEON	2	tg128	4.16 ± 0.00	5.27 ± 0.01	1.267
	4	tg128	7.88 ± 0.00	9.99 ± 0.01	1.268
	8	tg128	14.74 ± 0.26	19.19 ± 0.01	1.302

With the deltas being per group of 8, we cannot make use of the q8 sums stored in q8_1, so we get a tiny gain by using q8_0_x4.

Kawrakow added 5 commits February 5, 2025 16:59

iq1_m_r4: basics (quantize/dequantize)

212e8d1

iq1_m_r4: Zen4 gemm

b0ba33b

iq1_m_r4: neon gemm

c12b21b

iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4

dbc30e1

With the deltas being per group of 8, we cannot make use of the q8 sums stored in q8_1, so we get a tiny gain by using q8_0_x4.

iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0

54585d6

ikawrakow merged commit 7f61b30 into main Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQ1_M_R4: better 1.75 bpw quants #187

IQ1_M_R4: better 1.75 bpw quants #187

ikawrakow commented Feb 6, 2025

IQ1_M_R4: better 1.75 bpw quants #187

IQ1_M_R4: better 1.75 bpw quants #187

Conversation

ikawrakow commented Feb 6, 2025