IQ4_K: SOTA 4-bit quantization #6

ikawrakow · 2024-07-28T09:49:05Z

Same 4.5 bpw as Q4_K.
Significantly reduces quantization error of LLaMA-3.1 (and also 3.0). E.g., 1.77% vs 2.9% for Q4_K_S for LLaMA-3.1-8B (with quantization error defined as PPL(Q)/PPL(fp16)-1)
Non-linear quantization similar to IQ4_XS and IQ4_NL with the following differences
- Blocks of 16 instead of blocks of 32
- Non-linear values in each block of 16 can be on the original non-linear grid, or can be on a shifted grid. This is indicated by one bit, so we need 16 extra bits per block of 256
- So, we need 256 * 4 bits for the quants, 16 * 6 bits for the 6-bit block scales, 16 bits for the super-block float scale, and 16 bits for the shift bits, ending up with exactly 4.5 bpw
Performance is on par with Q4_K on AVX2 and CUDA, and slightly lower on ARM_NEON and Metal

* quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented.

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S.

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X.

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

…k_up_gate_fmoe Revert "Revert "Check if ffn_up and ffn_gate are of the same type before using fmoe""

Iwan Kawrakow added 7 commits July 27, 2024 17:05

iq4_k: TG now works on CUDA

41d20f6

iq4_k: AVX512 implementation

be34f76

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S.

iq4_k: AVX2 implementation

db87f76

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X.

iq4_k: NEON implementation

d89c88e

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

iq4_k: Metal implementation

8ffb645

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

iq4_k: scalar dot product

b29f64e

ikawrakow merged commit 291066e into main Jul 28, 2024

nux mentioned this pull request May 17, 2025

Bug: CUDA error: an illegal memory access was encountered #425

Closed

ciprianveg mentioned this pull request May 29, 2025

Bug: The streaming every couple of rows blocks for 5-8s #464

Closed

Ph0rk0z mentioned this pull request Jul 23, 2025

Bug: Command-A Spits incoherence when using -sm row #633

Open

ChicoPinto70 mentioned this pull request Jul 26, 2025

Port speculative decoding from upstream to llama-server #645

Merged

4 tasks

os360 mentioned this pull request Aug 27, 2025

Bug: Still crashing with -fmoe on AVX2 Q8_0 with unsloth/GLM-4.5-Air-GGUF/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf #736

Closed

Nexesenex pushed a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Oct 26, 2025

Merge pull request ikawrakow#6 from Thireus/revert-4-revert-3-ik/chec…

916c9c3

…k_up_gate_fmoe Revert "Revert "Check if ffn_up and ffn_gate are of the same type before using fmoe""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IQ4_K: SOTA 4-bit quantization #6

IQ4_K: SOTA 4-bit quantization #6

Uh oh!

ikawrakow commented Jul 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IQ4_K: SOTA 4-bit quantization #6

IQ4_K: SOTA 4-bit quantization #6

Uh oh!

Conversation

ikawrakow commented Jul 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants