-
Notifications
You must be signed in to change notification settings - Fork 155
Adding IQ3_KS quants #566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding IQ3_KS quants #566
Conversation
This gives usPP-512 = 360 t/s.
This gives us PP-512 = 164 t/s.
|
Let's merge this so people don't get crashes when trying to run |
|
On a SicariusSicariiStuff_Nano_Imp_1B-bf16 Llama 3.2 1B model I had on my drive. PPL 512 wikitest eng. IQ3_KT FTYPE Also, merged successfully on Croco.cpp, and it infers properly. |
|
@ikawrakow : You brought us SOTA quants in the 2.1-2.2x bpw and 3.1x-3.2 bpw range with the KS and KT quants (and so, IQ2_XXS, XS, and IQ3_XXS are close to obsolescence now), and IQ2_K/IQ2_S remain on duty, but there's now a SOTA quants gap in the 2.4-3.1bpw range. Would it be possible, mathematically wise, and interesting for you to develop a new IQ2_KL quant (in the 2.6875-2.75bpw range?), and offer a much more performant alternative to IQ2_S and IQ2_K, in line of what you developed recently? |
|
I have been thinking about this, but don't have a good idea how to spend the extra bits (extra compared to, e.g., |
|
Thanks for the explanation, I understand that the alternatives you have atm are quite unpractical. In any case, thank you for the IQ3_KS (and the Cuda MMQ Kernels you kindly provided for most quants), it completes the KS quants lot, which is more practical to quantize than the already very demanding indeed Trellis lot. I'm very happy with all of this, compared to what mainline limits itself to atm. |
This PR adds
IQ3_KS- 3.1875 bpw quants with a block size of 32. This makes theIQX_KSquant series completeCUDA and CPU performance are very good, Metal is not so great.
Here a few sweep-benches for LlaMA-3.1-8B-Instruct
RTX-4080
Ryzen-7950X (Zen4)
Ryzen-5975WX
M2-Max CPU
M2-Max 30-core GPU