- 
                Notifications
    You must be signed in to change notification settings 
- Fork 155
IQ4_KSS improvements #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQ4_KSS improvements #642
Conversation
| So, I'll disappear tomorrow for 2 weeks. Do I merge this before I go? | 
| YOLO! (you only live once 🤣)
i have not tested yet, but it seems at quick glance the code changes don't
effect non-IQ4_KSS quants. as there aren't any of those quants released of
which i know — yeah merge it and we can sort it out later lol!
unrelated, i have not opened an issue, but was having a segfault in
llama-quantize with IQ3_KT trellis quant so have not released. Recipe here:
https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF#iq2_kt-todo
Finally, unrelated, when trying to run this IQ2_KL (it quantizes fine) but
crashes with asserts towards the end of starting up :
https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF#iq2_kl-169597-gib-3034-bpw
compiled CPU only, on that big dual socket epyc
Sorry, not at home today now for proper logs
finally finally, feel free to ignore all this and have a great couple
weeks!!! 😋 catch you later! | 
| When you get a chance, post the assert that the  | 
| 
 Noooooo Not urgent, but did you have the chance to look into the issue where imatrix data for  | 
| 
 Ha, I looked into it, then searched for the thread where we were talking about it, didn't find it, and then forgot. I'm actually not sure what happens in the Kimi runs. imatrix works fine when I test with a smaller model with the same attention architecture (DeepSeek-Lite). I tested with a GGUF created specifically for  So, in short, just try running without  | 
| Hope you get some sleep before your travels! Besides we can just use Qwen3-Coder now to fix everything right? 🤣 I'll open proper issues for these if I can't figure it out. Zero rush or priority here as I've not released these two models giving me troubles. Just got a laptop with some WiFi and can give a quick log: 
 EDIT Here is the Issue: #649 IQ2_KL assert run and logmodel=/mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf
numactl -N 1 -m 1 \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct \
    --ctx-size 196608 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ub 4096 -b 4096 \
    --parallel 3 \
    --threads 128 \
    --threads-batch 192 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap
INFO [                    main] build info | tid="127586578487488" timestamp=1753302334 build=3821 commit="1b052109"
INFO [                    main] system info | tid="127586578487488" timestamp=1753302334 n_threads=128 n_threads_batch=192 total_threads=768 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 41 key-value pairs and 747 tensors from /mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 Coder 480B A35B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 480B-A35B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv   8:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   9:                       qwen3moe.block_count u32              = 62
llama_model_loader: - kv  10:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  11:                  qwen3moe.embedding_length u32              = 6144
llama_model_loader: - kv  12:               qwen3moe.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:              qwen3moe.attention.head_count u32              = 96
llama_model_loader: - kv  14:           qwen3moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  16:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  18:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  19:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  20:                          general.file_type u32              = 155
llama_model_loader: - kv  21:                      qwen3moe.expert_count u32              = 160
llama_model_loader: - kv  22:        qwen3moe.expert_feed_forward_length u32              = 2560
llama_model_loader: - kv  23: qwen3moe.expert_shared_feed_forward_length u32              = 0
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% macro render_item_list(item_list, ...
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/Qwen3-Coder...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 497
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 840
llama_model_loader: - kv  38:                                   split.no u16              = 0
llama_model_loader: - kv  39:                                split.count u16              = 4
llama_model_loader: - kv  40:                        split.tensors.count i32              = 747
llama_model_loader: - type  f32:  311 tensors
llama_model_loader: - type q8_0:  124 tensors
llama_model_loader: - type iq3_k:   62 tensors
llama_model_loader: - type iq4_k:    1 tensors
llama_model_loader: - type iq6_k:  125 tensors
llama_model_loader: - type iq2_kl:  124 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 62
llm_load_print_meta: n_head           = 96
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 12
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 160
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ2_KL - 2.6875 bpw
llm_load_print_meta: model params     = 480.155 B
llm_load_print_meta: model size       = 169.597 GiB (3.034 BPW) 
llm_load_print_meta: repeating layers = 168.388 GiB (3.024 BPW, 478.288 B parameters)
llm_load_print_meta: general.name     = Qwen3 Coder 480B A35B Instruct
llm_load_print_meta: BOS token        = 11 ','
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 2560
llm_load_tensors: ggml ctx size =    0.33 MiB
llm_load_tensors:        CPU buffer size = 173666.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 196608
llama_new_context_with_model: n_batch    = 4096
llama_new_context_with_model: n_ubatch   = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size = 25296.00 MiB
llama_new_context_with_model: KV self size  = 25296.00 MiB, K (q8_0): 12648.00 MiB, V (q8_0): 12648.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.32 MiB
llama_new_context_with_model:        CPU compute buffer size =  5184.05 MiB
llama_new_context_with_model: graph nodes  = 2424
llama_new_context_with_model: graph splits = 1
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 4140403 is a zombie - the process has already terminated
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
./myscripts/api-server-Qwen3-Coder-480B-A35B-Instruct.sh: line 34: 4140403 Aborted                 (core dumped) numactl -N 1 -m 1 ./build/bin/llama-server --model "$model" --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct --ctx-size 196608 -ctk q8_0 -ctv q8_0 -fa -fmoe -ub 4096 -b 4096 --parallel 3 --threads 128 --threads-batch 192 --numa numactl --host 127.0.0.1 --port 8080 --no-mmap
 EDIT here is that issue with debug logs: #650 Yeah, I'll give full logs on its own issue later, it could just be this hardware possibly as it throws an error in  segfault quantizing iq3_kt$ sudo dmest -T --follow
[Wed Jul 23 16:36:14 2025] llama-quantize[4140724]: segfault at 7dd4d780a9d0 ip 00007eb9b81c634f sp 00007fff3c7bfd40 error 4 in libggml.so[9c634f,7eb9b7815000+9be000] likely on CPU 195 (core 3, socket 1)
[Wed Jul 23 16:36:14 2025] Code: ca 0f 87 80 fe ff ff c5 e8 57 d2 c5 f8 28 c2 e9 7f fe ff ff 8b bd 20 ff ff ff 8b b5 24 ff ff ff 8d 14 fd 00 00 00 00 48 63 d2 <c5> fa 10 04 90 48 8d 14 95 04 00 00 00 c5 fa 11 03 c5 fa 10 04 10
$ #!/usr/bin/env bash
# Repeating Layers [0-61]
custom="
# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq4_kt
blk\..*\.attn_v.*=iq4_kt
blk\..*\.attn_output.*=iq4_kt
# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"
custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
    /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
    /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf \
    IQ2_KT \
    192
main: build = 3823 (fd711836)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf' to '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf' as IQ2_KT using 192 threads
llama_model_loader: additional 20 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 37 key-value pairs and 747 tensors from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 Coder 480B A35B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 480B-A35B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv   8:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   9:                       qwen3moe.block_count u32              = 62
llama_model_loader: - kv  10:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  11:                  qwen3moe.embedding_length u32              = 6144
llama_model_loader: - kv  12:               qwen3moe.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:              qwen3moe.attention.head_count u32              = 96
llama_model_loader: - kv  14:           qwen3moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  16:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  18:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  19:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  20:                          general.file_type u32              = 32
llama_model_loader: - kv  21:                      qwen3moe.expert_count u32              = 160
llama_model_loader: - kv  22:        qwen3moe.expert_feed_forward_length u32              = 2560
llama_model_loader: - kv  23: qwen3moe.expert_shared_feed_forward_length u32              = 0
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% macro render_item_list(item_list, ...
llama_model_loader: - kv  34:                                   split.no u16              = 0
llama_model_loader: - kv  35:                                split.count u16              = 21
llama_model_loader: - kv  36:                        split.tensors.count i32              = 747
llama_model_loader: - type  f32:  311 tensors
llama_model_loader: - type bf16:  436 tensors
================================ Have weights data with 497 entries
[   1/ 747]                    token_embd.weight - [ 6144, 151936,     1,     1], type =   bf16, Using custom type iq4_kt for tensor token_embd.weight
====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to iq4_kt .. Adding custom rule blk\..*\.attn_q.* -> iq4_kt
Adding custom rule blk\..*\.attn_k.* -> iq4_kt
Adding custom rule blk\..*\.attn_v.* -> iq4_kt
Adding custom rule blk\..*\.attn_output.* -> iq4_kt
Adding custom rule blk\..*\.ffn_down_exps\.weight -> iq3_kt
Adding custom rule blk\..*\.ffn_(gate|up)_exps\.weight -> iq2_kt
Adding custom rule token_embd\.weight -> iq4_kt
Adding custom rule output\.weight -> iq6_k
load_imatrix: imatrix dataset='ubergarm-imatrix-calibration-corpus-v02.txt'
load_imatrix: loaded 497 importance matrix entries from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat computed on 840 chunks
prepare_imatrix: have 497 importance matrix entries
size =  1780.50 MiB ->   445.70 MiB
[   2/ 747]             blk.0.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[   3/ 747]                  blk.0.attn_k.weight - [ 6144,  1024,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_k.weight
converting to iq4_kt .. cluster_points: Oops. Cluster 4 has no points:  0 1 0 0
cluster_points: 1 out of 625 clusters dir not have any points
size =    12.00 MiB ->     3.00 MiB
[   4/ 747]             blk.0.attn_output.weight - [12288,  6144,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_output.weight
converting to iq4_kt .. size =   144.00 MiB ->    36.02 MiB
[   5/ 747]             blk.0.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[   6/ 747]                  blk.0.attn_q.weight - [ 6144, 12288,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_q.weight
converting to iq4_kt .. size =   144.00 MiB ->    36.05 MiB
[   7/ 747]                  blk.0.attn_v.weight - [ 6144,  1024,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_v.weight
converting to iq4_kt .. size =    12.00 MiB ->     3.00 MiB
[   8/ 747]               blk.0.attn_norm.weight - [ 6144,     1,     1,     1], type =    f32, size =    0.023 MB
[   9/ 747]           blk.0.ffn_down_exps.weight - [ 2560,  6144,   160,     1], type =   bf16, Using custom type iq3_kt for tensor blk.0.ffn_down_exps.weight
converting to iq3_kt .. ./myscripts/quantize-Qwen3-Coder-480B-A35B-Instruct-v08.sh: line 33: 2323451 Segmentation fault      (core dumped) numactl -N 0 -m 0 ./build/bin/llama-quantize --custom-q "$custom" --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf IQ2_KT 192I can open a 3rd issue for the mla stuff and put all the notes in one place along with ik's above comments and work together to figure out what is going on. thanks! EDIT Here is that issue now: #651 | 
| 
 @ikawrakow : And now that it has Cuda MMQ, I will use it! Thanks for completing it! And have a great time off! | 
| 
 Thank you for the detailed explanation! Since I rely on @ubergarm's imatrix due to hardware limitations (no pressure as well), I won't be able to verify this on my end right now. You'll be back in two weeks anyway (have a great time!). 
 You seem like someone who would really appreciate Termux. Apologies for the poor internet, seems we're all on vacation/away 😅 termux.mp4
 That sounds really nice! Thanks | 
|   The IQ4_KSS is looking like a pretty good spot for ubergarm/Qwen3-235B-A22B-Thinking-2507 | 
| I used Qwen3-Coder-480B-A35B-Instruct-IQ5_K to vibe code up some new matplot lib software and actually fixup my Y-axis log scale more similar to how I've seen some of ik's plots. The IQ4_KSS recipes seem quite strong. They differ slightly from each other, exact recipe in links below.     
 UPDATE And just finished up the bigger Qwen3-Coder-480B-A35B-Instruct-GGUF IQ4_KSS   (*note that the IQ2_K here is using iq2_kl as ffn_down_exps instead of larger iq3_k so it is right in line with what the IQ2_KS would be for size and PPL). | 
| 
 From ikawrakow: 
 If you have the chance to, could you please compare IQ4_KSS to IQ4_KT in PPL and TG PP speed? | 
| 
 Hrmm, good idea. I'm already comparing the Q4_0, IQ4_KSS, and IQ3_KT with some llama-sweep-bench and getting interesting results. I'm cooking basically a "pure" IQ4_KT now to compare which will be slightly smaller than the IQ4_KSS which has a few juiced layers and slightly boosted attn tensors. Just a teasier it looks like TG performance is faster across the board on vulkan backend with CUDA 12.9  | 
| Thanks! Well, that's weird. Shouldn't IQ4_KT be higher quality thanks to trellis quantization? Oh, nvm, I see that the rest of the tensors are different between the two models you tested. Would you have the time to compare a IQ4_KT with the same "juiced" layers as IQ4_KSS, for fairness? Maybe this new mix could have a better PPL for the size compared to the current IQ4_KSS? Damn, that made me remember an old idea where we could treat quant mixes as an optimization problem and try to bruteforce our way to the lowest PPL considering size, iteratively. Wait, isn't this what Unsloth brands as dynamic quants 2.0? But you beat him most of the time with your mixes, lmao? Or is it because he simply uses quants from mainline I assume? Also, how much do you gain with  | 
| 
 That is indeed the next question, I almost did it twice but had to take a dinner break lol, but now its on its way - EDIT see next comment for the graph including this data: 👈 The ~4BPW Quant Data  {
    "name": "IQ4_KSS",
    "ppl": "7.3861 +/- 0.05128",
    "size": 15.531,
    "bpw": 4.370,
    "legend": "ubergarm",
    "comment": "iq6_k k|v, iq5_k q|o, juiced attn layers 0, iq4_ks down, iq4_kss gate|up, juiced ffn layers 0|47, iq4_k/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
  },
  {
    "name": "juiced-IQ4_KT",
    "ppl": "7.4226 +/- 0.05154",
    "size": 15.244,
    "bpw": 4.289,
    "legend": "ubergarm",
    "comment": "iq6_k k|v, iq5_k q|o, juiced attn layers 0, iq4_kt down, iq4_kt gate|up, juiced ffn layers 0|47, iq4_kt/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
  },
  {
    "name": "IQ4_KT",
    "ppl": "7.5020 +/- 0.05230",
    "size": 14.438,
    "bpw": 4.062,
    "legend": "ubergarm",
    "comment": "mostly pure iq4_kt iq4_kt/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
  },
 Haha, yeah this is what they put on their modelcard: 
 here is what I put on my modelcard: 
 Maybe both can be true if you don't consider me as providing "leading quants"! 😹 I'm not convinced the way unsloth has decided to vary tensor quantization types across layers gives a particularly better performance (speed or perplexity). I think a its a balance of trade-offs between: 
 Usually you can make a pretty good quant with a decent balance by a combination of: 
 
 It is a fun little multi-variable human gradient descent hobby! 😹 Regarding the second half of your question, Unsloth expressed some possible interest in releasing unsloth ik_llama.cpp quants in another post on this repo in the past. And, yes, it probably would help them push the pareto curve ever downward as well by using these new quants. 
 I haven't fully explored it e.g. by automating some kind of  👈 --layer-similarity for Qwen3-30B-A3B-Thinking-2507
 ======================== sorted layer importances
  0: Layer  47, <cos_sim> = 0.297816
  1: Layer   0, <cos_sim> = 0.305244
  2: Layer   1, <cos_sim> = 0.709352
  3: Layer  28, <cos_sim> = 0.830869
  4: Layer   2, <cos_sim> = 0.844787
  5: Layer   7, <cos_sim> = 0.861447
  6: Layer  29, <cos_sim> = 0.864968
  7: Layer   3, <cos_sim> = 0.880728
  8: Layer   8, <cos_sim> = 0.892042
  9: Layer   6, <cos_sim> = 0.905458
 10: Layer   5, <cos_sim> = 0.90886
 11: Layer  42, <cos_sim> = 0.914703
 12: Layer   4, <cos_sim> = 0.915015
 13: Layer  17, <cos_sim> = 0.91581
 14: Layer  13, <cos_sim> = 0.921882
 15: Layer  46, <cos_sim> = 0.926183
 16: Layer  45, <cos_sim> = 0.932304
 17: Layer  19, <cos_sim> = 0.936483
 18: Layer  18, <cos_sim> = 0.937157
 19: Layer  31, <cos_sim> = 0.940826
 20: Layer  14, <cos_sim> = 0.942221
 21: Layer  40, <cos_sim> = 0.944539
 22: Layer   9, <cos_sim> = 0.94595
 23: Layer  10, <cos_sim> = 0.94767
 24: Layer  25, <cos_sim> = 0.948227
 25: Layer  11, <cos_sim> = 0.94864
 26: Layer  32, <cos_sim> = 0.948681
 27: Layer  37, <cos_sim> = 0.949749
 28: Layer  41, <cos_sim> = 0.951289
 29: Layer  39, <cos_sim> = 0.952341
 30: Layer  12, <cos_sim> = 0.953235
 31: Layer  44, <cos_sim> = 0.953276
 32: Layer  16, <cos_sim> = 0.95375
 33: Layer  20, <cos_sim> = 0.954073
 34: Layer  38, <cos_sim> = 0.954789
 35: Layer  22, <cos_sim> = 0.955904
 36: Layer  15, <cos_sim> = 0.956555
 37: Layer  21, <cos_sim> = 0.956733
 38: Layer  23, <cos_sim> = 0.957164
 39: Layer  43, <cos_sim> = 0.958506
 40: Layer  30, <cos_sim> = 0.958633
 41: Layer  27, <cos_sim> = 0.959653
 42: Layer  24, <cos_sim> = 0.960708
 43: Layer  36, <cos_sim> = 0.964712
 44: Layer  26, <cos_sim> = 0.964958
 45: Layer  35, <cos_sim> = 0.965977
 46: Layer  34, <cos_sim> = 0.968197
 47: Layer  33, <cos_sim> = 0.972509
======================== sorted attention importances
  0: Layer   0, <cos_sim> = 0.373726
  1: Layer  45, <cos_sim> = 0.621582
  2: Layer   1, <cos_sim> = 0.668392
  3: Layer  29, <cos_sim> = 0.675207
  4: Layer  17, <cos_sim> = 0.704994
  5: Layer  21, <cos_sim> = 0.708088
  6: Layer   3, <cos_sim> = 0.712065
  7: Layer  44, <cos_sim> = 0.719689
  8: Layer  22, <cos_sim> = 0.726337
  9: Layer  42, <cos_sim> = 0.728414
 10: Layer  23, <cos_sim> = 0.734638
 11: Layer  18, <cos_sim> = 0.734929
 12: Layer  24, <cos_sim> = 0.735911
 13: Layer   8, <cos_sim> = 0.73788
 14: Layer  33, <cos_sim> = 0.741519
 15: Layer  27, <cos_sim> = 0.742112
 16: Layer  46, <cos_sim> = 0.742959
 17: Layer  30, <cos_sim> = 0.745445
 18: Layer  34, <cos_sim> = 0.746015
 19: Layer  47, <cos_sim> = 0.746472
 20: Layer   9, <cos_sim> = 0.746761
 21: Layer   6, <cos_sim> = 0.748994
 22: Layer  20, <cos_sim> = 0.752889
 23: Layer   2, <cos_sim> = 0.753263
 24: Layer  41, <cos_sim> = 0.754112
 25: Layer  25, <cos_sim> = 0.755797
 26: Layer  26, <cos_sim> = 0.755917
 27: Layer  28, <cos_sim> = 0.75632
 28: Layer  43, <cos_sim> = 0.757009
 29: Layer  35, <cos_sim> = 0.758833
 30: Layer   4, <cos_sim> = 0.75965
 31: Layer  10, <cos_sim> = 0.766588
 32: Layer  36, <cos_sim> = 0.768189
 33: Layer  19, <cos_sim> = 0.768958
 34: Layer  32, <cos_sim> = 0.769336
 35: Layer  11, <cos_sim> = 0.771553
 36: Layer  31, <cos_sim> = 0.781223
 37: Layer  16, <cos_sim> = 0.785931
 38: Layer   7, <cos_sim> = 0.786268
 39: Layer  15, <cos_sim> = 0.787708
 40: Layer   5, <cos_sim> = 0.790609
 41: Layer  12, <cos_sim> = 0.791013
 42: Layer  37, <cos_sim> = 0.792411
 43: Layer  14, <cos_sim> = 0.794113
 44: Layer  39, <cos_sim> = 0.794925
 45: Layer  38, <cos_sim> = 0.795931
 46: Layer  40, <cos_sim> = 0.799352
 47: Layer  13, <cos_sim> = 0.802178
======================== sorted ffn importances
  0: Layer  47, <cos_sim> = 0.533469
  1: Layer  44, <cos_sim> = 0.622946
  2: Layer   0, <cos_sim> = 0.643964
  3: Layer  28, <cos_sim> = 0.67538
  4: Layer   7, <cos_sim> = 0.684103
  5: Layer  16, <cos_sim> = 0.69021
  6: Layer  21, <cos_sim> = 0.703409
  7: Layer  43, <cos_sim> = 0.703716
  8: Layer  20, <cos_sim> = 0.703982
  9: Layer   1, <cos_sim> = 0.709765
 10: Layer  45, <cos_sim> = 0.711489
 11: Layer  46, <cos_sim> = 0.715068
 12: Layer  33, <cos_sim> = 0.721819
 13: Layer  19, <cos_sim> = 0.725088
 14: Layer  22, <cos_sim> = 0.72533
 15: Layer  32, <cos_sim> = 0.730856
 16: Layer   3, <cos_sim> = 0.731085
 17: Layer   8, <cos_sim> = 0.731686
 18: Layer   9, <cos_sim> = 0.736359
 19: Layer  23, <cos_sim> = 0.736744
 20: Layer   2, <cos_sim> = 0.737244
 21: Layer  31, <cos_sim> = 0.739362
 22: Layer  24, <cos_sim> = 0.743266
 23: Layer  34, <cos_sim> = 0.743324
 24: Layer  41, <cos_sim> = 0.744927
 25: Layer  40, <cos_sim> = 0.749878
 26: Layer  10, <cos_sim> = 0.75342
 27: Layer  26, <cos_sim> = 0.753776
 28: Layer  27, <cos_sim> = 0.758283
 29: Layer  17, <cos_sim> = 0.759731
 30: Layer  35, <cos_sim> = 0.763794
 31: Layer  18, <cos_sim> = 0.765849
 32: Layer   6, <cos_sim> = 0.766675
 33: Layer  42, <cos_sim> = 0.767223
 34: Layer  36, <cos_sim> = 0.767253
 35: Layer  29, <cos_sim> = 0.767677
 36: Layer   4, <cos_sim> = 0.770757
 37: Layer  25, <cos_sim> = 0.771877
 38: Layer  30, <cos_sim> = 0.778096
 39: Layer  12, <cos_sim> = 0.784316
 40: Layer   5, <cos_sim> = 0.785474
 41: Layer  15, <cos_sim> = 0.787438
 42: Layer  11, <cos_sim> = 0.790912
 43: Layer  39, <cos_sim> = 0.79183
 44: Layer  14, <cos_sim> = 0.795523
 45: Layer  38, <cos_sim> = 0.79796
 46: Layer  13, <cos_sim> = 0.816884
 47: Layer  37, <cos_sim> = 0.819399 | 
| 
 Thanks for the insights, I will definitely use some of these points later 
 I was daily driving EXL2 and <= 120b models before all those MoEs came out, now it's impossible to come back 😂 (it's also way more fun to run something you aren't supposed to run on your hardware) Waiting for TP in EXL3 before trying again... (or maybe? #627) 
 Looks like a full-time job, tbh Thanks again for these results | 





Not much is known about
IQ4_KSS, and nobody seems to be using it. So, I decided to give it some attention.Quick reminder (for more, see #89)
IQ4_KSSuses exactly 4.0 bpw just likeIQ4_KTIQ4_KT(after this PR)IQ4_KT(after this PR)IQ4_KTIQ4_KTThis PR
Q8_K_R8for fast CPU GEMM