Port IQ*_K quants from ik_llama.cpp by AesSedai · Pull Request #19726 · ggml-org/llama.cpp

AesSedai · 2026-02-19T06:42:42Z

This PR is an initial effort at porting @ikawrakow's IQ*_K quants from ik_llama.cpp to mainline llama.cpp. Attribution has been provided for the quantization code, and if additional attribution work is required please let me know.

This branch implements the CPU backend for the following quantization types:

IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K

Models quantized with these types from ik_llama.cpp should correctly load in this PR, and llama.cpp can now be used to produce those quantizations.

Since this implementation is for the CPU backend only currently, do not expect excellent performance. Further backends will be added in future PRs (CUDA, Vulkan, etc.).

I made a few small quality-of-life tweaks to test-quantize-fns and it passes for these new quantization types:

$ ./build/bin/test-quantize-fns
...
Testing iq2_k
Testing iq3_k
Testing iq4_k
Testing iq5_k
Testing iq6_k

I have also done some initial KLD testing by quantizing a model with ik_llama.cpp, collecting logits for it, then loading that model in llama.cpp and comparing its KLD:

# Qwen3-4B-Instruct-2507-IQ6_K.gguf
====== Perplexity statistics ======
Mean PPL(Q)                   :  10.910113 ±   1.495592
Mean PPL(base)                :  10.803911 ±   1.477219
Cor(ln(PPL(Q)), ln(PPL(base))):  99.79%
Mean ln(PPL(Q)/PPL(base))     :   0.009782 ±   0.008832
Mean PPL(Q)/PPL(base)         :   1.009830 ±   0.008919
Mean PPL(Q)-PPL(base)         :   0.106202 ±   0.097559

====== KL divergence statistics ======
Mean    KLD:   0.011825 ±   0.000975
Maximum KLD:   0.270169
99.9%   KLD:   0.243841
99.0%   KLD:   0.099079
95.0%   KLD:   0.032769
90.0%   KLD:   0.026773
Median  KLD:   0.005709
10.0%   KLD:   0.000064
 5.0%   KLD:   0.000014
 1.0%   KLD:   0.000000
 0.1%   KLD:  -0.000004
Minimum KLD:  -0.000004

Don't expect the PPL and KLD to match up 100% due to backend differences and general noise. However the results should be reasonable and within ballpark range.

Disclaimer: AI was used to translate much of the implementation from ik_llama.cpp, especially the quantizations in ggml/src/ggml-quants.c and ggml/src/ggml-cpu/quants.c

I will follow up with further KLD and PPL testing tomorrow to cover models quanted with each of the newly ported types.

EAddario · 2026-02-19T07:51:51Z

Noice!

CISC · 2026-02-19T08:04:37Z

While I would love nothing more, this is a prickly subject...

JohannesGaessler · 2026-02-19T08:13:01Z

I cannot review, let alone merge any code written by Iwan Kawrakow unless and until the conflict between him and Georgi Gerganov has been resolved.

EAddario · 2026-02-19T08:58:13Z

I do not know the full story that led to the project being forked, so please take the following just as an observation. I do not mean to start a crusade 😁

Both project owners, by virtue being licensed under the MIT license, have given consent to anyone (individuals or entities) to use the code in any way they desire and/or create any derivative works provided that the original copyright and license notices are clearly included in all copies. That is already the case on both projects.

Owners will of course decide what they accept and need not give any reasons for their decisions, but from the licensing perspective, @AesSedai is well within the scope of what's permissible and many many many people, including yours, will ABSOLUTELY LOVE to have IK's quants in llama.cpp, as his work is nothing short of exceptional, just as GG's

In the end, one can only hope ☮️

pwilkin · 2026-02-19T11:56:20Z

I cannot review, let alone merge any code written by Iwan Kawrakow unless and until the conflict between him and Georgi Gerganov has been resolved.

I think @AesSedai actually rewrote the code himself to avoid that exact problem :)

ikawrakow · 2026-02-19T13:14:12Z

It seems someone propagated the urban legend around the Internet that I was going to sue llama.cpp developers if they copied code that I wrote. Or at least so am I told as I don't hang around the channels where this is being claimed. Given @pwilkin's comment, I'm going to add my 2 cents before this gets even more out of hand.

First: in its current form, the PR is perfectly fine with me.

Second: no, I'm not going to sue llama.cpp contributors (or anyone else for that matter). I have better things to do. If anything, I should go and publicly shame those who have lifted code and ideas out of ik_llama.cpp, and added them here without acknowledging the origin. But I'm not doing even that, other than the occasional sarcastic comment in my repository about the fully independent llama.cpp discoveries, which, by some miracle, tend to occur hours or days or weeks after being published in ik_llama.cpp.

Third:
@pwilkin In what sense has the code been rewritten? Here is an example screenshot of vimdiff of the IQ2_KS quantization function for you.

This is a copy, and not a rewrite. In the current state of the PR, where the origin of this code and the copyright is being acknowledged, this is perfectly fine and in the spirit of the MIT license under which the original code has been published:

If that would be removed because some people believe that it is inconvenient to maintain copyright notices (as it has already happened), that would not be fine. Am I going to go and sue llama.cpp developers in that case? No, of course not. I should publicly shame them instead. But I'm not doing even that, see above.

If you want to see "rewritten" or "ported" code, please take a look at the Qwen3-Next implementation in ik_llama.cpp, which did indeed start from your Qwen3-Next implementation here, but ended up being a "port" (or "rewrite", depending on your preference), and not a copy.

pwilkin · 2026-02-19T13:38:22Z

@ikawrakow okay, sorry, my bad, I was under the impression that it was a rewrite based on previous conversations, I shouldn't have jumped to conclusions before reading the code :)

AesSedai · 2026-02-19T14:34:40Z

I had previously discussed doing a clean room implementation of the trellis quants based on the QTIP paper and that's what @pwilkin was referring to, but for PR I opted to copy / port IK's quantization code directly as it would be nice to have interoperability for the model quants.

My Kimi-K2.5 vision PR for llama.cpp has been mostly lifted and shifted into ik_llama (ikawrakow/ik_llama.cpp#1280), as well and there's a pretty robust history of PRs sharing code between these two projects.

It looks like the CI runner isn't happy anyways so there's some more work I'll need to do to clean those up later today.

I appreciate everyone who has weighed in on this issue and hope this effort can continue forward.

ikawrakow · 2026-02-19T14:52:56Z

@AesSedai

I had previously discussed doing a clean room implementation of the trellis quants based on the QTIP paper

The Trellis used in ik_llama.cpp is different from the Trellises in the QTIP paper. It uses integer math only, which allows for a vastly more efficient implementation on the CPU (IIRC, 3-4X compared to the QTIP paper Trellis as initially implemented in ik_llama.cpp). CUDA is faster too, but to a much lesser extent than the CPU. Hence, if you did a clean room implementation based on the QTIP paper, that would not be compatible with the IQ1_KT, IQ2_KT, IQ3_KT and IQ4_KT quants in ik_llama.cpp (and so, the @ubergarm models couldn't be used here). Just pointing out in case this is not already known.

AesSedai · 2026-02-19T16:39:59Z

Yep, I initially wanted to just include the IQ_KS quants here but stumbled into including the IQ_K's too due to some llama_ftype and other llama-quantize dependencies. Probably I should have backed out the IQ_KS quants and focused on just the IQ_K quants for this first round but I already had the inference code for IQ_KS working. In the spirit of also not going too overboard, I also skipped the AVX-based implementations but I'd love to dig into those since there's some substantial perf gains to be had there.

The IQ_KT quants would be a future project, and yes any of @ubergarm's quants using KT wouldn't work here.

Thanks for the insight about the Trellis differences too, I hadn't reviewed the ik_llama implementation for those but forewarned is forearmed! As a dev with a server mostly consisting of RAM instead of VRAM, I appreciate the CPU-based performance considerations you've developed.

AesSedai · 2026-02-19T22:29:32Z

Removed the KS quants from this PR (will add them in a later PR) to simplify review, and fixed a const issue with IQ6_K.

These are the results from some quantization and llama-perplexity testing on Qwen3-4B-Instruct-2507

ik_llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Quant Type	Size	BPW	PPL(Q)/PPL(base)	Mean KLD	Same Top P
IQ2_K	1.28 GiB	2.73	0.999823 ± 0.005020	0.012506 ± 0.000749	93.41%
IQ3_K	1.67 GiB	3.57	0.994594 ± 0.003895	0.005542 ± 0.000260	96.08%
IQ4_K	2.21 GiB	4.72	1.004556 ± 0.003485	0.004944 ± 0.000198	96.16%
IQ5_K	2.62 GiB	5.60	0.991185 ± 0.003721	0.004930 ± 0.000239	94.90%
IQ6_K	3.1 GiB	6.63	1.001598 ± 0.003613	0.005122 ± 0.000366	94.98%

The IQ2_K KLD looks a little high in comparison to the others, and when I used llama.cpp to quantize the model and compare that to the ik_llama.cpp logits that does tell me something is still fishy:

llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Quant Type	Size	BPW	PPL(Q)/PPL(base)	Mean KLD	Same Top P
IQ2_K	1.3 GiB	2.78	1.485431 ± 0.045757	0.435849 ± 0.012438	62.90%
IQ3_K	1.75 GiB	3.74	0.965421 ± 0.010749	0.049772 ± 0.002152	87.92%
IQ4_K	2.2 GiB	4.70	1.011881 ± 0.006546	0.011057 ± 0.000506	94.75%
IQ5_K	2.62 GiB	5.60	0.999889 ± 0.003887	0.005916 ± 0.000342	94.82%
IQ6_K	3.1 GiB	6.62	1.005003 ± 0.004177	0.007996 ± 0.000293	93.57%

AesSedai · 2026-02-19T23:47:32Z

Forgot to check the quantizations in llama_tensor_get_type which was causing the llama.cpp quantized models to look weird.

llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Quant Type	Size	BPW	PPL(Q)/PPL(base)	Mean KLD	Same Top P
IQ2_K	1.28 GiB	2.73	1.002670 ± 0.005058	0.012151 ± 0.000767	94.35%
IQ3_K	1.67 GiB	3.57	0.996797 ± 0.003826	0.005451 ± 0.000205	95.77%
IQ4_K	2.16 GiB	4.62	1.009513 ± 0.005691	0.013857 ± 0.000477	93.33%
IQ5_K	2.62 GiB	5.60	0.999889 ± 0.003887	0.005916 ± 0.000342	94.82%
IQ6_K	3.1 GiB	6.62	1.005003 ± 0.004177	0.007996 ± 0.000293	93.57%

CISC · 2026-02-23T08:46:12Z

@AesSedai @ikawrakow Looks like IQ4_K is slightly broken on big endian:
https://github.com/ggml-org/llama.cpp/actions/runs/22297154038/job/64496145701?pr=19726#step:8:84739

AesSedai · 2026-02-23T16:42:04Z

@CISC Yeah I saw that failed CI run, I got the windows compilation issues fixed but I'll need to do some investigation into the s390x / big endian quantization later today after work.

After doing some more digging last night, I think that the IKQ computation path would be needed to get the KLD closer but that is a LOT of code to add and way beyond the scope I'm trying to attempt here in this PR. I did a bit of toying around with ik_llama.cpp to disable the internal conversion with iqk_convert_iq2_k_q8_k_r8 and that brought the IQ2_K down from ~0.0125 to ~0.00105 which is a little better. I don't think this would affect the model creation with llama-quantize, so adding in iqk_mul_mat in the future would be an inference-time performance / accuracy upgrade.

@ggerganov what are your thoughts about this PR outside of that failing CI that I need to fix? Johannes has stated that he won't consider reviewing or merging this until you weigh in on it, and IK has said that he's fine with the PR in its current form. I understand if you don't want to merge this in, but I'd like to ask you to kindly consider it.

JohannesGaessler · 2026-02-23T17:51:30Z

To clarify my position, regardless of whatever Georgi's position is I would still be opposed to merging any of Iwan's code due to the following points:

Code that has in the past been committed under MIT must not be questioned after the fact. Iwan requested his code be relicensed or removed again in Licenses/Copyright in llama.cpp / ggml / whisper.cpp #6394 and he later repeated this sentiment in Mainline is now copying stuff from ik_llama.cpp ikawrakow/ik_llama.cpp#316 .
I do not want to read the code of anyone who is uncharitable with what they think constitutes a "substantial portion" of their work under the MIT license, in particular when it comes to derivative works of their code, see Mainline is now copying stuff from ik_llama.cpp ikawrakow/ik_llama.cpp#316 (comment) .
Given my constraints the way I approach the situation is to just not read any of Iwan's code for my work. Looking at New tensor parallel in llama.cpp ikawrakow/ik_llama.cpp#1247 he clearly does not believe me though. I think that that will make more drama inevitable in the future.

ggerganov · 2026-02-23T18:18:55Z

@JohannesGaessler Thank you for bringing this latest discussion to my attention.

@AesSedai I apologies, but I will have to close this PR. Thank you for your effort.

port IQ*_K and IQ*_KS quants from ik_llama.cpp

8a22220

AesSedai requested review from CISC, JohannesGaessler and ggerganov as code owners February 19, 2026 06:42

AesSedai marked this pull request as draft February 19, 2026 06:42

github-actions bot added testing Everything test related examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Feb 19, 2026

AesSedai added 3 commits February 19, 2026 14:20

remove ks quants, just do k quants for first pass

175ebb3

remove row_meta_size since that is for KS quants

2caf788

fix wrong float const for IQ6_K

4a14ce2

AesSedai changed the title ~~Port IQ*_K and IQ*_KS quants from ik_llama.cpp~~ Port IQ*_K quants from ik_llama.cpp Feb 19, 2026

AesSedai added 2 commits February 19, 2026 14:49

forgot test-quantize-fns

15741a3

fix some ftype quantization selections

028e576

github-actions bot mentioned this pull request Feb 20, 2026

Reddit News Daily 2026-02-20 gitlawr/reddit-daily-news#161

Open

AesSedai added 2 commits February 22, 2026 22:39

Merge remote-tracking branch 'origin/master' into iq-k-ks-quants

2b0aaa1

Fix compilation / linking issues on windows hopefully

a68cb93

ggerganov closed this Feb 23, 2026

ggml-org locked and limited conversation to collaborators Feb 23, 2026

Conversation

AesSedai commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Feb 19, 2026

Uh oh!

CISC commented Feb 19, 2026

Uh oh!

JohannesGaessler commented Feb 19, 2026

Uh oh!

EAddario commented Feb 19, 2026

Uh oh!

pwilkin commented Feb 19, 2026

Uh oh!

ikawrakow commented Feb 19, 2026

Uh oh!

pwilkin commented Feb 19, 2026

Uh oh!

AesSedai commented Feb 19, 2026

Uh oh!

ikawrakow commented Feb 19, 2026

Uh oh!

AesSedai commented Feb 19, 2026

Uh oh!

AesSedai commented Feb 19, 2026

ik_llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Uh oh!

AesSedai commented Feb 19, 2026

llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity

Uh oh!

CISC commented Feb 23, 2026

Uh oh!

AesSedai commented Feb 23, 2026

Uh oh!

JohannesGaessler commented Feb 23, 2026

Uh oh!

ggerganov commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AesSedai commented Feb 19, 2026 •

edited

Loading