Skip to content

Conversation

@ikawrakow
Copy link
Owner

I was experimenting with LlaMA-4-Scout quantization and was bothered by the extremely long quantization time of IQ1_M, so looked into speeding things up.

This PR improves IQ1_M quantization speed by a huge margin. There is also a minor improvement in quantization accuracy.

The table shows PPL comparisons between the main branch and this PR for LLaMA-v1-7B1(L1-7B in the table), LLaMA-v2-7B1 (L2-7B), Mistral-7B1 (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Unlike earlier quantization improvement PRs, which used "pure" quantization (--pure command line option in llama-quantize), tested is the default IQ1_M quantization mix.

Model Quantization PPL (main) PPL (this PR) Q-time (main) Q-time (this PR)
L1-7B IQ1_M 10.9274 10.8046 N/A2 N/A2
L2-7B IQ1_M 10.7642 10.6809 129.4 52.8
M-7B IQ1_M 9.6336 9.6236 146.1 58.4
L3-8B IQ1_M 22.7422 21.9715 148.1 60.0
DSL IQ1_M 9.2758 9.1137 267.4 109.2

Speedup for the default IQ1_M quantization mix is in the range of 2.5X. When quantizing pure IQ1_M, the speedup is about 3X.


1 Why use such ancient models? The LLaMA-v1 models were the basis for k-quants development. I-quants were developed using LLaMA-v1, LLaMA-v2 and Mistral-7B. In my experience, if a quantization technique does well on all 3 of these, it is (almost) guaranteed to do well on any other model out there.

2 I have this model on an old HDD. In this case quantization time is dominated by the time needed to read the data from the HDD. I could have copied the model to the SSD drive, but I think the timing for the other models gives enough indication of the relative performance.

@ikawrakow ikawrakow merged commit d210661 into main Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants