llama-quantize: --dry-run option by ikawrakow · Pull Request #1309 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-23T15:18:54Z

Prints the tensor types and resulting tensor sizes, but does not run the quantization, so it is very fast.

Useful for experimenting with --custom-q before running the actual quantization.

Enable with --dry-run, which needs to appear before the model name as all other optional llama-quantize arguments.

Nexesenex · 2026-02-23T15:27:30Z

I was about to (try to) copyport it myself.
Thanks!

ubergarm · 2026-02-23T18:00:05Z

Tested and confirmed this branch is working and does not create/overwrite any specified output file when using --dry-run.

Here is an example quantize command like I generally use, and now with --dry-run I can confirm final model size and tweak as desired to hit a target breakpoint with more confidence.

👈 Details

#!/usr/bin/env bash

custom="
# 60 Repeating Layers [0-59]

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
blk\..*\.ssm_ba\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

    #--exclude-weights ffn_gate_exps \
    #--exclude-weights ffn_up_exps \
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --dry-run \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/imatrix-Qwen3-Coder-Next-BF16.dat \
    /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-512x2.5B-BF16-00001-of-00004.gguf \
    /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-IQ1_KT.gguf \
    IQ1_KT \
    128

It does not let you know if there will be errors due to imatrix related stuff, but that is fine with me. For example. it seems like the qwen3next models don't like very low quantization types e.g. iq1_kt, iq2_kt, iq2_kl, iq1_s at least for ffn_(gate|up)_exps (but ik2_ks is fine). But this is not specific to this PR. fwiw seems to error out even when skipping imatrix data for those, just noting the observation:

converting to q8_0 .. size =    16.00 MiB ->     8.50 MiB
[  11/ 843]           blk.0.ffn_down_exps.weight - [  512,  2048,   512,     1], type =   bf16, Using custom type iq2_kt for tensor blk.0.ffn_down_exp
s.weight
converting to iq2_kt .. size =  1024.00 MiB ->   140.00 MiB
[  12/ 843]           blk.0.ffn_gate_exps.weight - [ 2048,   512,   512,     1], type =   bf16, Using custom type iq1_kt for tensor blk.0.ffn_gate_exp
s.weight
converting to iq1_kt .. Oops: jbest = -1 for cluster 98 with 1160 points
/home/w/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8334: GGML_ASSERT(false) failed
Oops: jbest = -1 for cluster 98 with 1160 points
/home/w/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8334: GGML_ASSERT(false) failed
Oops: jbest = -1 for cluster 240 with 1024 points
.
.
.

Yes thanks for porting this feature and thanks to @ddh0 on ggml-org/llama.cpp#19526 which I was using a rough "vibe port" of myself already xD

ikawrakow · 2026-02-24T14:21:32Z

It does not let you know if there will be errors due to imatrix related stuf

It doesn't? It is supposed to. Commenting out the check in examples/quantize/quantize.cpp (so we can get past the imatrix police) and running

./bin/llama-quantize --dry-run $some_model junk iq2_xxs

I get

main: quantizing '../models/il32_3B/Llama-3.2-3B-Instruct-BF16.gguf' to 'mist.bin' as IQ2_XXS
llama_model_loader: loaded meta data with 31 key-value pairs and 255 tensors from ../models/il32_3B/Llama-3.2-3B-Instruct-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 32
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type bf16:  197 tensors
[   1/ 255]                    rope_freqs.weight - [   64,     1,     1,     1], type =    f32, size =    0.000 MB
[   2/ 255]                    token_embd.weight - [ 3072, 128256,     1,     1], type =   bf16, converting to iq4_k .. size =   751.50 MiB ->   211.36 MiB
[   3/ 255]               blk.0.attn_norm.weight - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[   4/ 255]                blk.0.ffn_down.weight - [ 8192,  3072,     1,     1], type =   bf16, converting to q2_K .. size =    48.00 MiB ->     7.88 MiB
[   5/ 255]                blk.0.ffn_gate.weight - [ 3072,  8192,     1,     1], type =   bf16, 

============================================================
Missing importance matrix for tensor blk.0.ffn_gate.weight in a very low-bit quantization
The result will be garbage, so bailing out
============================================================

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate.weight in a very low-bit quantization

so yes, it will abort if there are missing entries in the imatrix data.

ubergarm · 2026-02-24T17:12:58Z

so yes, it will abort if there are missing entries in the imatrix data.

Ah yes, so it does. The error I'm getting is unrelated to imatrix then, I can open an issue specific for that, thanks for confirming!

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

llama-quantize: --dry-run option

1a40325

ikawrakow merged commit cfb6747 into main Feb 24, 2026

Quairon-Nailo mentioned this pull request Feb 25, 2026

Bug: Step 3.5 Flash higher VRAM usage and loading issues after #1307 #1324

Open

abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026

llama-quantize: --dry-run option (ikawrakow#1309)

e17d8c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-quantize: --dry-run option#1309

llama-quantize: --dry-run option#1309
ikawrakow merged 1 commit intomainfrom
ik/quantize_dry_run

ikawrakow commented Feb 23, 2026 •

edited

Loading

Uh oh!

Nexesenex commented Feb 23, 2026 •

edited

Loading

Uh oh!

ubergarm commented Feb 23, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Feb 24, 2026

Uh oh!

ubergarm commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nexesenex commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Feb 24, 2026

Uh oh!

ubergarm commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikawrakow commented Feb 23, 2026 •

edited

Loading

Nexesenex commented Feb 23, 2026 •

edited

Loading

ubergarm commented Feb 23, 2026 •

edited

Loading