Port IQ*_K quants from ik_llama.cpp#19726
Conversation
|
Noice! |
|
While I would love nothing more, this is a prickly subject... |
|
I cannot review, let alone merge any code written by Iwan Kawrakow unless and until the conflict between him and Georgi Gerganov has been resolved. |
|
I do not know the full story that led to the project being forked, so please take the following just as an observation. I do not mean to start a crusade 😁 Both project owners, by virtue being licensed under the MIT license, have given consent to anyone (individuals or entities) to use the code in any way they desire and/or create any derivative works provided that the original copyright and license notices are clearly included in all copies. That is already the case on both projects. Owners will of course decide what they accept and need not give any reasons for their decisions, but from the licensing perspective, @AesSedai is well within the scope of what's permissible and many many many people, including yours, will ABSOLUTELY LOVE to have IK's quants in llama.cpp, as his work is nothing short of exceptional, just as GG's In the end, one can only hope ☮️ |
I think @AesSedai actually rewrote the code himself to avoid that exact problem :) |
|
It seems someone propagated the urban legend around the Internet that I was going to sue First: in its current form, the PR is perfectly fine with me. Second: no, I'm not going to sue Third:
This is a copy, and not a rewrite. In the current state of the PR, where the origin of this code and the copyright is being acknowledged, this is perfectly fine and in the spirit of the MIT license under which the original code has been published:
If that would be removed because some people believe that it is inconvenient to maintain copyright notices (as it has already happened), that would not be fine. Am I going to go and sue If you want to see "rewritten" or "ported" code, please take a look at the Qwen3-Next implementation in |
|
@ikawrakow okay, sorry, my bad, I was under the impression that it was a rewrite based on previous conversations, I shouldn't have jumped to conclusions before reading the code :) |
|
I had previously discussed doing a clean room implementation of the trellis quants based on the QTIP paper and that's what @pwilkin was referring to, but for PR I opted to copy / port IK's quantization code directly as it would be nice to have interoperability for the model quants. My Kimi-K2.5 vision PR for llama.cpp has been mostly lifted and shifted into ik_llama (ikawrakow/ik_llama.cpp#1280), as well and there's a pretty robust history of PRs sharing code between these two projects. It looks like the CI runner isn't happy anyways so there's some more work I'll need to do to clean those up later today. I appreciate everyone who has weighed in on this issue and hope this effort can continue forward. |
The Trellis used in |
|
Yep, I initially wanted to just include the IQ_KS quants here but stumbled into including the IQ_K's too due to some llama_ftype and other llama-quantize dependencies. Probably I should have backed out the IQ_KS quants and focused on just the IQ_K quants for this first round but I already had the inference code for IQ_KS working. In the spirit of also not going too overboard, I also skipped the AVX-based implementations but I'd love to dig into those since there's some substantial perf gains to be had there. The IQ_KT quants would be a future project, and yes any of @ubergarm's quants using KT wouldn't work here. Thanks for the insight about the Trellis differences too, I hadn't reviewed the ik_llama implementation for those but forewarned is forearmed! As a dev with a server mostly consisting of RAM instead of VRAM, I appreciate the CPU-based performance considerations you've developed. |
|
Removed the KS quants from this PR (will add them in a later PR) to simplify review, and fixed a const issue with IQ6_K. These are the results from some quantization and llama-perplexity testing on Qwen3-4B-Instruct-2507 ik_llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity
The IQ2_K KLD looks a little high in comparison to the others, and when I used llama.cpp to quantize the model and compare that to the ik_llama.cpp logits that does tell me something is still fishy: llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity
|
|
Forgot to check the quantizations in llama.cpp model, ik_llama.cpp logits, llama.cpp llama-perplexity
|
|
@AesSedai @ikawrakow Looks like IQ4_K is slightly broken on big endian: |
|
@CISC Yeah I saw that failed CI run, I got the windows compilation issues fixed but I'll need to do some investigation into the s390x / big endian quantization later today after work. After doing some more digging last night, I think that the IKQ computation path would be needed to get the KLD closer but that is a LOT of code to add and way beyond the scope I'm trying to attempt here in this PR. I did a bit of toying around with ik_llama.cpp to disable the internal conversion with @ggerganov what are your thoughts about this PR outside of that failing CI that I need to fix? Johannes has stated that he won't consider reviewing or merging this until you weigh in on it, and IK has said that he's fine with the PR in its current form. I understand if you don't want to merge this in, but I'd like to ask you to kindly consider it. |
|
To clarify my position, regardless of whatever Georgi's position is I would still be opposed to merging any of Iwan's code due to the following points:
|
|
@JohannesGaessler Thank you for bringing this latest discussion to my attention. @AesSedai I apologies, but I will have to close this PR. Thank you for your effort. |


This PR is an initial effort at porting @ikawrakow's IQ*_K quants from ik_llama.cpp to mainline llama.cpp. Attribution has been provided for the quantization code, and if additional attribution work is required please let me know.
This branch implements the CPU backend for the following quantization types:
Models quantized with these types from ik_llama.cpp should correctly load in this PR, and llama.cpp can now be used to produce those quantizations.
Since this implementation is for the CPU backend only currently, do not expect excellent performance. Further backends will be added in future PRs (CUDA, Vulkan, etc.).
I made a few small quality-of-life tweaks to
test-quantize-fnsand it passes for these new quantization types:I have also done some initial KLD testing by quantizing a model with ik_llama.cpp, collecting logits for it, then loading that model in llama.cpp and comparing its KLD:
Don't expect the PPL and KLD to match up 100% due to backend differences and general noise. However the results should be reasonable and within ballpark range.
Disclaimer: AI was used to translate much of the implementation from ik_llama.cpp, especially the quantizations in
ggml/src/ggml-quants.candggml/src/ggml-cpu/quants.cI will follow up with further KLD and PPL testing tomorrow to cover models quanted with each of the newly ported types.