Llama-quantize: Partial requant feature by Nexesenex · Pull Request #1313 · ikawrakow/ik_llama.cpp

Nexesenex · 2026-02-24T09:25:45Z

Inspired by the recently added --dry-run option for llama-quantize.

This PR allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. (useful for whoever is used to change certains tensors' quantization to improve an overall quant strategy: just delete the splits you want to requantize in the destination directory)

It works both for GGUF which are split tensor by tensor, or by group of several tensors. (though this one is not very much tested except with 2 tensors by split: I'm myself using directories of single tensors GGUFs since @Thireus made his GGUF-Tool-Suite)

It also adds automatic directory creation for both llama-quantize and gguf-split in case the destination directory of the quantization/split doesn't exist. (A longstanding lacking feature ^^)

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ikawrakow · 2026-02-24T14:01:45Z

@Nexesenex

Precisely what do you mean by "copyported"? "Copyported" from where? Considering that I wrote most of the llama-quantize code in llama.cpp, do you seriously believe that I need to "copyport" somebody else's modifications?

examples/gguf-split/gguf-split.cpp

src/llama-quantize.cpp

ikawrakow

Please merge and resolve conflicts so I don't need to be reviewing the --dry-run changes along with the actual changes of the PR.

- Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded.

Nexesenex · 2026-02-24T15:59:23Z

Precisely what do you mean by "copyported"? "Copyported" from where? Considering that I wrote most of the llama-quantize code in llama.cpp, do you seriously believe that I need to "copyport" somebody else's modifications?

I apologize, I simply meant that I saw an equivalent PR on llama.cpp, and saw a recent conversation when the terms "copying" and "porting" were opposed. I merged both terms as I did in the initial thread, forgetting that applied to what I would do, not to what you would do (rewriting it fully by yourself).

Now, you know very well that I don't believe that you NEED to "copyport" somebody's else modifications considering that you indeed wrote most of the llama-quantize code of llama.cpp, and that I didn't imply that precise statement. But what goes without saying goes better by saying it. I apologize again for the misunderstanding born of my inadequate terminology, it bore no malice.

Then, the merger is done, and I'm currently correcting my PR accordingly to your review.

* Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

Nexesenex changed the title ~~Llama-quantize: Partial requant~~ Llama-quantize: Partial requant feature Feb 24, 2026

Nexesenex force-pushed the partial_requant branch from 6e504b4 to f5b4bae Compare February 24, 2026 09:55

ikawrakow reviewed Feb 24, 2026

View reviewed changes

examples/gguf-split/gguf-split.cpp Show resolved Hide resolved

ikawrakow reviewed Feb 24, 2026

View reviewed changes

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved

ikawrakow reviewed Feb 24, 2026

View reviewed changes

src/llama-quantize.cpp Show resolved Hide resolved

ikawrakow reviewed Feb 24, 2026

View reviewed changes

Nexesenex added 3 commits February 24, 2026 16:04

Create output directory if it doesn't exist in llama-quantize

6bb7bd2

Create output directory if it doesn't exist in gguf-split

3136e22

Nexesenex force-pushed the partial_requant branch from f5b4bae to 3136e22 Compare February 24, 2026 15:05

Add exit when directory fails to be created on Windows

f08b89c

Nexesenex added 2 commits February 24, 2026 19:23

Use std::filesystem

81d3f16

cleanup

5464c1a

ikawrakow approved these changes Feb 25, 2026

View reviewed changes

ikawrakow merged commit 170467e into ikawrakow:main Feb 25, 2026

Nexesenex deleted the partial_requant branch February 25, 2026 06:32

Nexesenex restored the partial_requant branch February 27, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-quantize: Partial requant feature#1313

Llama-quantize: Partial requant feature#1313
ikawrakow merged 6 commits intoikawrakow:mainfrom
Nexesenex:partial_requant

Nexesenex commented Feb 24, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikawrakow left a comment

Uh oh!

Nexesenex commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Nexesenex commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

Nexesenex commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Nexesenex commented Feb 24, 2026 •

edited

Loading

Nexesenex commented Feb 24, 2026 •

edited

Loading