Skip to content

Port IQ*_K quants from ik_llama.cpp#19726

Closed
AesSedai wants to merge 8 commits intoggml-org:masterfrom
AesSedai:iq-k-ks-quants
Closed

Port IQ*_K quants from ik_llama.cpp#19726
AesSedai wants to merge 8 commits intoggml-org:masterfrom
AesSedai:iq-k-ks-quants

Conversation

@AesSedai
Copy link
Contributor

@AesSedai AesSedai commented Feb 19, 2026

This PR is an initial effort at porting @ikawrakow's IQ*_K quants from ik_llama.cpp to mainline llama.cpp. Attribution has been provided for the quantization code, and if additional attribution work is required please let me know.

This branch implements the CPU backend for the following quantization types:

  • IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K

Models quantized with these types from ik_llama.cpp should correctly load in this PR, and llama.cpp can now be used to produce those quantizations.

Since this implementation is for the CPU backend only currently, do not expect excellent performance. Further backends will be added in future PRs (CUDA, Vulkan, etc.).

I made a few small quality-of-life tweaks to test-quantize-fns and it passes for these new quantization types:

$ ./build/bin/test-quantize-fns
...
Testing iq2_k
Testing iq3_k
Testing iq4_k
Testing iq5_k
Testing iq6_k

I have also done some initial KLD testing by quantizing a model with ik_llama.cpp, collecting logits for it, then loading that model in llama.cpp and comparing its KLD:

# Qwen3-4B-Instruct-2507-IQ6_K.gguf
====== Perplexity statistics ======
Mean PPL(Q)                   :  10.910113 ±   1.495592
Mean PPL(base)                :  10.803911 ±   1.477219
Cor(ln(PPL(Q)), ln(PPL(base))):  99.79%
Mean ln(PPL(Q)/PPL(base))     :   0.009782 ±   0.008832
Mean PPL(Q)/PPL(base)         :   1.009830 ±   0.008919
Mean PPL(Q)-PPL(base)         :   0.106202 ±   0.097559

====== KL divergence statistics ======
Mean    KLD:   0.011825 ±   0.000975
Maximum KLD:   0.270169
99.9%   KLD:   0.243841
99.0%   KLD:   0.099079
95.0%   KLD:   0.032769
90.0%   KLD:   0.026773
Median  KLD:   0.005709
10.0%   KLD:   0.000064
 5.0%   KLD:   0.000014
 1.0%   KLD:   0.000000
 0.1%   KLD:  -0.000004
Minimum KLD:  -0.000004

Don't expect the PPL and KLD to match up 100% due to backend differences and general noise. However the results should be reasonable and within ballpark range.

Disclaimer: AI was used to translate much of the implementation from ik_llama.cpp, especially the quantizations in ggml/src/ggml-quants.c and ggml/src/ggml-cpu/quants.c

I will follow up with further KLD and PPL testing tomorrow to cover models quanted with each of the newly ported types.

@AesSedai AesSedai marked this pull request as draft February 19, 2026 06:42
@github-actions github-actions bot added testing Everything test related examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Feb 19, 2026
@EAddario
Copy link
Contributor

Noice!

@CISC
Copy link
Member

CISC commented Feb 19, 2026

While I would love nothing more, this is a prickly subject...

@JohannesGaessler
Copy link
Contributor

I cannot review, let alone merge any code written by Iwan Kawrakow unless and until the conflict between him and Georgi Gerganov has been resolved.

@EAddario
Copy link
Contributor

I do not know the full story that led to the project being forked, so please take the following just as an observation. I do not mean to start a crusade 😁

Both project owners, by virtue being licensed under the MIT license, have given consent to anyone (individuals or entities) to use the code in any way they desire and/or create any derivative works provided that the original copyright and license notices are clearly included in all copies. That is already the case on both projects.

Owners will of course decide what they accept and need not give any reasons for their decisions, but from the licensing perspective, @AesSedai is well within the scope of what's permissible and many many many people, including yours, will ABSOLUTELY LOVE to have IK's quants in llama.cpp, as his work is nothing short of exceptional, just as GG's

In the end, one can only hope ☮️

@pwilkin
Copy link
Contributor

pwilkin commented Feb 19, 2026

I cannot review, let alone merge any code written by Iwan Kawrakow unless and until the conflict between him and Georgi Gerganov has been resolved.

I think @AesSedai actually rewrote the code himself to avoid that exact problem :)

@ikawrakow
Copy link
Contributor

It seems someone propagated the urban legend around the Internet that I was going to sue llama.cpp developers if they copied code that I wrote. Or at least so am I told as I don't hang around the channels where this is being claimed. Given @pwilkin's comment, I'm going to add my 2 cents before this gets even more out of hand.

First: in its current form, the PR is perfectly fine with me.

Second: no, I'm not going to sue llama.cpp contributors (or anyone else for that matter). I have better things to do. If anything, I should go and publicly shame those who have lifted code and ideas out of ik_llama.cpp, and added them here without acknowledging the origin. But I'm not doing even that, other than the occasional sarcastic comment in my repository about the fully independent llama.cpp discoveries, which, by some miracle, tend to occur hours or days or weeks after being published in ik_llama.cpp.

Third:
@pwilkin In what sense has the code been rewritten? Here is an example screenshot of vimdiff of the IQ2_KS quantization function for you.

Screenshot 2026-02-19 at 1 35 37 PM

This is a copy, and not a rewrite. In the current state of the PR, where the origin of this code and the copyright is being acknowledged, this is perfectly fine and in the spirit of the MIT license under which the original code has been published:

Screenshot 2026-02-19 at 1 47 19 PM

If that would be removed because some people believe that it is inconvenient to maintain copyright notices (as it has already happened), that would not be fine. Am I going to go and sue llama.cpp developers in that case? No, of course not. I should publicly shame them instead. But I'm not doing even that, see above.

If you want to see "rewritten" or "ported" code, please take a look at the Qwen3-Next implementation in ik_llama.cpp, which did indeed start from your Qwen3-Next implementation here, but ended up being a "port" (or "rewrite", depending on your preference), and not a copy.

@pwilkin
Copy link
Contributor

pwilkin commented Feb 19, 2026

@ikawrakow okay, sorry, my bad, I was under the impression that it was a rewrite based on previous conversations, I shouldn't have jumped to conclusions before reading the code :)

@AesSedai
Copy link
Contributor Author

I had previously discussed doing a clean room implementation of the trellis quants based on the QTIP paper and that's what @pwilkin was referring to, but for PR I opted to copy / port IK's quantization code directly as it would be nice to have interoperability for the model quants.

My Kimi-K2.5 vision PR for llama.cpp has been mostly lifted and shifted into ik_llama (ikawrakow/ik_llama.cpp#1280), as well and there's a pretty robust history of PRs sharing code between these two projects.

It looks like the CI runner isn't happy anyways so there's some more work I'll need to do to clean those up later today.

I appreciate everyone who has weighed in on this issue and hope this effort can continue forward.

@ikawrakow
Copy link
Contributor

@AesSedai

I had previously discussed doing a clean room implementation of the trellis quants based on the QTIP paper

The Trellis used in ik_llama.cpp is different from the Trellises in the QTIP paper. It uses integer math only, which allows for a vastly more efficient implementation on the CPU (IIRC, 3-4X compared to the QTIP paper Trellis as initially implemented in ik_llama.cpp). CUDA is faster too, but to a much lesser extent than the CPU. Hence, if you did a clean room implementation based on the QTIP paper, that would not be compatible with the IQ1_KT, IQ2_KT, IQ3_KT and IQ4_KT quants in ik_llama.cpp (and so, the @ubergarm models couldn't be used here). Just pointing out in case this is not already known.

@AesSedai
Copy link
Contributor Author

Yep, I initially wanted to just include the IQ_KS quants here but stumbled into including the IQ_K's too due to some llama_ftype and other llama-quantize dependencies. Probably I should have backed out the IQ_KS quants and focused on just the IQ_K quants for this first round but I already had the inference code for IQ_KS working. In the spirit of also not going too overboard, I also skipped the AVX-based implementations but I'd love to dig into those since there's some substantial perf gains to be had there.

The IQ_KT quants would be a future project, and yes any of @ubergarm's quants using KT wouldn't work here.

Thanks for the insight about the Trellis differences too, I hadn't reviewed the ik_llama implementation for those but forewarned is forearmed! As a dev with a server mostly consisting of RAM instead of VRAM, I appreciate the CPU-based performance considerations you've developed.

@AesSedai
Copy link
Contributor Author

Removed the KS quants from this PR (will add them in a later PR) to simplify review, and fixed a const issue with IQ6_K.

These are the results from some quantization and llama-perplexity testing on Qwen3-4B-Instruct-2507

ik_llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Quant Type Size BPW PPL(Q)/PPL(base) Mean KLD Same Top P
IQ2_K 1.28 GiB 2.73 0.999823 ± 0.005020 0.012506 ± 0.000749 93.41%
IQ3_K 1.67 GiB 3.57 0.994594 ± 0.003895 0.005542 ± 0.000260 96.08%
IQ4_K 2.21 GiB 4.72 1.004556 ± 0.003485 0.004944 ± 0.000198 96.16%
IQ5_K 2.62 GiB 5.60 0.991185 ± 0.003721 0.004930 ± 0.000239 94.90%
IQ6_K 3.1 GiB 6.63 1.001598 ± 0.003613 0.005122 ± 0.000366 94.98%

The IQ2_K KLD looks a little high in comparison to the others, and when I used llama.cpp to quantize the model and compare that to the ik_llama.cpp logits that does tell me something is still fishy:

llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Quant Type Size BPW PPL(Q)/PPL(base) Mean KLD Same Top P
IQ2_K 1.3 GiB 2.78 1.485431 ± 0.045757 0.435849 ± 0.012438 62.90%
IQ3_K 1.75 GiB 3.74 0.965421 ± 0.010749 0.049772 ± 0.002152 87.92%
IQ4_K 2.2 GiB 4.70 1.011881 ± 0.006546 0.011057 ± 0.000506 94.75%
IQ5_K 2.62 GiB 5.60 0.999889 ± 0.003887 0.005916 ± 0.000342 94.82%
IQ6_K 3.1 GiB 6.62 1.005003 ± 0.004177 0.007996 ± 0.000293 93.57%

@AesSedai AesSedai changed the title Port IQ*_K and IQ*_KS quants from ik_llama.cpp Port IQ*_K quants from ik_llama.cpp Feb 19, 2026
@AesSedai
Copy link
Contributor Author

Forgot to check the quantizations in llama_tensor_get_type which was causing the llama.cpp quantized models to look weird.

llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Quant Type Size BPW PPL(Q)/PPL(base) Mean KLD Same Top P
IQ2_K 1.28 GiB 2.73 1.002670 ± 0.005058 0.012151 ± 0.000767 94.35%
IQ3_K 1.67 GiB 3.57 0.996797 ± 0.003826 0.005451 ± 0.000205 95.77%
IQ4_K 2.16 GiB 4.62 1.009513 ± 0.005691 0.013857 ± 0.000477 93.33%
IQ5_K 2.62 GiB 5.60 0.999889 ± 0.003887 0.005916 ± 0.000342 94.82%
IQ6_K 3.1 GiB 6.62 1.005003 ± 0.004177 0.007996 ± 0.000293 93.57%

@CISC
Copy link
Member

CISC commented Feb 23, 2026

@AesSedai @ikawrakow Looks like IQ4_K is slightly broken on big endian:
https://github.com/ggml-org/llama.cpp/actions/runs/22297154038/job/64496145701?pr=19726#step:8:84739

@AesSedai
Copy link
Contributor Author

@CISC Yeah I saw that failed CI run, I got the windows compilation issues fixed but I'll need to do some investigation into the s390x / big endian quantization later today after work.

After doing some more digging last night, I think that the IKQ computation path would be needed to get the KLD closer but that is a LOT of code to add and way beyond the scope I'm trying to attempt here in this PR. I did a bit of toying around with ik_llama.cpp to disable the internal conversion with iqk_convert_iq2_k_q8_k_r8 and that brought the IQ2_K down from ~0.0125 to ~0.00105 which is a little better. I don't think this would affect the model creation with llama-quantize, so adding in iqk_mul_mat in the future would be an inference-time performance / accuracy upgrade.

@ggerganov what are your thoughts about this PR outside of that failing CI that I need to fix? Johannes has stated that he won't consider reviewing or merging this until you weigh in on it, and IK has said that he's fine with the PR in its current form. I understand if you don't want to merge this in, but I'd like to ask you to kindly consider it.

@JohannesGaessler
Copy link
Contributor

To clarify my position, regardless of whatever Georgi's position is I would still be opposed to merging any of Iwan's code due to the following points:

@ggerganov
Copy link
Member

@JohannesGaessler Thank you for bringing this latest discussion to my attention.

@AesSedai I apologies, but I will have to close this PR. Thank you for your effort.

@ggerganov ggerganov closed this Feb 23, 2026
@ggml-org ggml-org locked and limited conversation to collaborators Feb 23, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

examples ggml changes relating to the ggml tensor library for machine learning python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants