GLM-5 support by ikawrakow · Pull Request #1268 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-14T07:15:29Z

As in the mainline PR no DSA and no MTP.

It just reuses the DeepSeek2 arch.

Ph0rk0z · 2026-02-14T14:47:02Z

How well does it run? It's quite the fat boy.

ubergarm · 2026-02-14T16:38:46Z

Loaded and running llama-perplexity on the full BF16 GGUF now and will compare to mainline results. If it is looking good, I'll cook imatrix and try to release a rather quantized smol-IQ1_KT which might be the only thing some rigs can fit for testing.

👈 Details

model=/mnt/data/models/ubergarm/GLM-5-GGUF/GLM-256x22B-5-BF16-00001-of-00033.gguf
numactl --interleave=all \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa distribute \
    --threads 160 \
    --threads-batch 192 \
    --validate-quants \
    --no-mmap

SOCKET is set to: 0
main: build = 4193 (9d9c6261)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337
CPU: using device CPU - 0 MiB free
llama_model_loader: additional 32 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 58 key-value pairs and 1809 tensors from /mnt/data/models/ubergarm/GLM-5-GGUF/GLM-256x22B-5-BF16-00001-of-00033.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm-dsa
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   3:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   4:                               general.name str              = GLM 5
llama_model_loader: - kv   5:                            general.version str              = 5
llama_model_loader: - kv   6:                           general.basename str              = GLM
llama_model_loader: - kv   7:                         general.size_label str              = 256x22B
llama_model_loader: - kv   8:                            general.license str              = mit
llama_model_loader: - kv   9:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  10:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv  11:                        glm-dsa.block_count u32              = 79
llama_model_loader: - kv  12:                     glm-dsa.context_length u32              = 202752
llama_model_loader: - kv  13:                   glm-dsa.embedding_length u32              = 6144
llama_model_loader: - kv  14:                glm-dsa.feed_forward_length u32              = 12288
llama_model_loader: - kv  15:               glm-dsa.attention.head_count u32              = 64
llama_model_loader: - kv  16:            glm-dsa.attention.head_count_kv u32              = 1
llama_model_loader: - kv  17:                     glm-dsa.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  18:   glm-dsa.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  glm-dsa.expert_used_count u32              = 8
llama_model_loader: - kv  20:                 glm-dsa.expert_group_count u32              = 1
llama_model_loader: - kv  21:            glm-dsa.expert_group_used_count u32              = 1
llama_model_loader: - kv  22:                 glm-dsa.expert_gating_func u32              = 2
llama_model_loader: - kv  23:               glm-dsa.attention.key_length u32              = 576
llama_model_loader: - kv  24:             glm-dsa.attention.value_length u32              = 512
llama_model_loader: - kv  25:                          general.file_type u32              = 32
llama_model_loader: - kv  26:          glm-dsa.leading_dense_block_count u32              = 3
llama_model_loader: - kv  27:                         glm-dsa.vocab_size u32              = 154880
llama_model_loader: - kv  28:              glm-dsa.attention.q_lora_rank u32              = 2048
llama_model_loader: - kv  29:             glm-dsa.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  30:           glm-dsa.attention.key_length_mla u32              = 256
llama_model_loader: - kv  31:         glm-dsa.attention.value_length_mla u32              = 256
llama_model_loader: - kv  32:         glm-dsa.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  33:                       glm-dsa.expert_count u32              = 256
llama_model_loader: - kv  34:                glm-dsa.expert_shared_count u32              = 1
llama_model_loader: - kv  35:               glm-dsa.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  36:                glm-dsa.expert_weights_norm bool             = true
llama_model_loader: - kv  37:               glm-dsa.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:               glm-dsa.nextn_predict_layers u32              = 1
llama_model_loader: - kv  39:       glm-dsa.attention.indexer.head_count u32              = 32
llama_model_loader: - kv  40:       glm-dsa.attention.indexer.key_length u32              = 128
llama_model_loader: - kv  41:            glm-dsa.attention.indexer.top_k u32              = 2048
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - kv  43:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  44:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  45:                      tokenizer.ggml.tokens arr[str,154880]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  46:                  tokenizer.ggml.token_type arr[i32,154880]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  47:                      tokenizer.ggml.merges arr[str,321649]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  48:                tokenizer.ggml.eos_token_id u32              = 154820
llama_model_loader: - kv  49:            tokenizer.ggml.padding_token_id u32              = 154820
llama_model_loader: - kv  50:                tokenizer.ggml.bos_token_id u32              = 154822
llama_model_loader: - kv  51:                tokenizer.ggml.eot_token_id u32              = 154827
llama_model_loader: - kv  52:            tokenizer.ggml.unknown_token_id u32              = 154820
llama_model_loader: - kv  53:                tokenizer.ggml.eom_token_id u32              = 154829
llama_model_loader: - kv  54:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv  55:                                   split.no u16              = 0
llama_model_loader: - kv  56:                                split.count u16              = 33
llama_model_loader: - kv  57:                        split.tensors.count i32              = 1809
llama_model_loader: - type  f32:  630 tensors
llama_model_loader: - type bf16: 1179 tensors
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 154820 ('<|endoftext|>')
load:   - 154827 ('<|user|>')
load:   - 154829 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9811 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = glm-dsa
llm_load_print_meta: n_ctx_train      = 202752
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 79
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 16384
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 202752
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 744B.A40B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 753.864 B
llm_load_print_meta: model size       = 1404.406 GiB (16.003 BPW) 
llm_load_print_meta: repeating layers = 1400.861 GiB (16.003 BPW, 751.961 B parameters)
llm_load_print_meta: general.name     = GLM 5
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 2048
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.0000
print_info: vocab type       = BPE
print_info: n_vocab          = 154880
print_info: n_merges         = 321649
print_info: BOS token        = 154822 '[gMASK]'
print_info: EOS token        = 154820 '<|endoftext|>'
print_info: EOT token        = 154827 '<|user|>'
print_info: EOM token        = 154829 '<|observation|>'
print_info: UNK token        = 154820 '<|endoftext|>'
print_info: PAD token        = 154820 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 154838 '<|code_prefix|>'
print_info: FIM SUF token    = 154840 '<|code_suffix|>'
print_info: FIM MID token    = 154839 '<|code_middle|>'
print_info: EOG token        = 154820 '<|endoftext|>'
print_info: EOG token        = 154827 '<|user|>'
print_info: EOG token        = 154829 '<|observation|>'
print_info: max token length = 1024
llm_load_tensors: ggml ctx size =    0.72 MiB
model has unused tensor blk.78.attn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a_norm.weight (size = 8192 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_norm.weight (size = 2048 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.attn_q_b.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_mqa.weight (size = 7077888 bytes) -- ignoring
model has unused tensor blk.78.attn_output.weight (size = 201326592 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.proj.weight (size = 393216 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_k.weight (size = 1572864 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_q_b.weight (size = 16777216 bytes) -- ignoring
model has unused tensor blk.78.ffn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_inp.weight (size = 6291456 bytes) -- ignoring
model has unused tensor blk.78.exp_probs_b.bias (size = 1024 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.nextn.eh_proj.weight (size = 150994944 bytes) -- ignoring
model has unused tensor blk.78.nextn.enorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.shared_head_norm.weight (size = 24576 bytes) -- ignoring
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/80 layers to GPU
llm_load_tensors:        CPU buffer size = 1419153.34 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 79 wkv_b tensors
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
Computed blk.0.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.1.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.2.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.3.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.4.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.5.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.6.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.7.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.8.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.9.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.10.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.11.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.12.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.13.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.14.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.15.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.16.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.17.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.18.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.19.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.20.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.21.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.22.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.23.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.24.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.25.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.26.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.27.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.28.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.29.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.30.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.31.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.32.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.33.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.34.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.35.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.36.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.37.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.38.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.39.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.40.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.41.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.42.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.43.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.44.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.45.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.46.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.47.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.48.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.49.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.50.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.51.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.52.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.53.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.54.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.55.allama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_batch       = 4096
llama_init_from_model: n_ubatch      = 4096
llama_init_from_model: flash_attn    = 1
llama_init_from_model: mla_attn      = 3
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =   351.00 MiB
llama_init_from_model: KV self size  =  351.00 MiB, c^KV (f16):  351.00 MiB, kv^T: not used
llama_init_from_model:        CPU  output buffer size =     4.73 MiB
llama_init_from_model:        CPU compute buffer size =  2516.00 MiB
llama_init_from_model: graph nodes  = 4322
llama_init_from_model: graph splits = 1
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

system_info: n_threads = 160 (n_threads_batch = 192) / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 806.999 ms
perplexity: calculating perplexity over 565 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 30.39 seconds per pass - ETA 35.77 minutes
ttn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.56.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.57.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.58.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.59.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.61.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.62.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.63.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.64.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.65.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.66.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.67.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.68.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.69.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.70.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.71.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.72.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.73.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.74.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.75.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.76.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.77.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.78.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
===================================== llama_init_from_model: f16
======================================= HAVE_FANCY_SIMD is defined
[1]1.2323,[2]1.9955,[3]1.7660,[4]1.5587,[5]1.4515,[6]1.3944,[7]1.3624,[8]1.3339,[9]1.3250,[10]1.3046,[11]1.2953,[12]1.3352,[13]1.3332,[14]1.3939,[15]1.4936,[16]1.5983,[17]1.7117,[18]1.8544,[19]1.8493,[20]1.8449,[21]1.9193,[22]1.9402,[23]1.9294,[24]1.9136,[25]1.9017,[26]1.9003,[27]1.9125,[28]1.9398,[29]1.9552,[30]2.0111,[31]2.0660,[32]2.1039,[33]2.1476,[34]2.1767,[35]2.2220,[36]2.2603,[37]2.2864,[38]2.3722,[39]2.4110,[40]2.4555,[41]2.5189,[42]2.5115,[43]2.5278,[44]2.5554,[45]2.6261,[46]2.6770,[47]2.6392,[48]2.6017,[49]2.5755,[50]2.5630,[51]2.5790,[52]2.6066,[53]2.6430,[54]2.6685,[55]2.6946,[56]2.7218,[57]2.7202,[58]2.7435,[59]2.7584,[60]2.7930,[61]2.8250,[62]2.8721,[63]2.9093,[64]2.9362,[65]2.9533,[66]2.9495,[67]2.9246,[68]2.9075,[69]2.9296,[70]2.9142,[71]2.8983,[72]2.8982,[73]2.9047,[74]2.9300,[75]2.9327,[76]2.8994,[77]2.8663,[78]2.8372,[79]2.8100,[80]2.7875,[81]2.7640,[82]2.7522,[83]2.7503,[84]2.7261,[85]2.7136,[86]2.7047,[87]2.6949,[88]2.6780,[89]2.6583,[90]2.6464,[91]2.6281,[92]2.6078,[93]2.5970,[94]2.5820,[95]2.5686,[96]2.5604,[97]2.5682,[98]2.5607,[99]2.5472,[100]2.5284,[101]2.5370,[102]2.5224,[103]2.5127,[104]2.5063,[105]2.5162,[106]2.5392,[107]2.5887,[108]2.5989,[109]2.6081,[110]2.6419,[111]2.6646,[112]2.6445,[113]2.6324,[114]2.6316,[115]2.6293,[116]2.6349,[117]2.6363,[118]2.6405,[119]2.6452,[120]2.6418,[121]2.6332,[122]2.6365,[123]2.6242,[124]2.6237,[125]2.6251,[126]2.6255,[127]2.6244,[128]2.6382,[129]2.6443,[130]2.6434,[131]2.6554,[132]2.6550,[133]2.6534,[134]2.6673,[135]2.6846,[136]2.6782,[137]2.6739,[138]2.6700,[139]2.6586,[140]2.6720,[141]2.6739,[142]2.6658,[143]2.6646,[144]2.6655,[145]2.6647,[146]2.6597,[147]2.6476,[148]2.6422,[149]2.6384,[150]2.6354,[151]2.6285,[152]2.6277,[153]2.6313,[154]2.6310,[155]2.6315,[156]2.6354,[157]2.6379,[158]2.6401,[159]2.6507,[160]2.6597,[161]2.6656,[162]2.6557,[163]2.6441,[164]2.6472,[165]2.6382,[166]2.6356,[167]2.6481,[168]2.6484,[169]2.6716,[170]2.6868,[171]2.6973,[172]2.7149,[173]2.7066,[174]2.6949,[175]2.6827,[176]2.6718,[177]2.6590,[178]2.6456,[179]2.6350,[180]2.6234,[181]2.6184,[182]2.6319,[183]2.6488,[184]2.6725,[185]2.6898,[186]2.7002,[187]2.7187,[188]2.7413,[189]2.7625,[190]2.7781,[191]2.7931,[192]2.8025,[193]2.8092,[194]2.8138,[195]2.8119,[196]2.8142,[197]2.8272,[198]2.8416,[199]2.8416,[200]2.8485,[201]2.8510,[202]2.8543,[203]2.8531,[204]2.8615,[205]2.8697,[206]2.8765,[207]2.8833,[208]2.8839,[209]2.8869,[210]2.8829,[211]2.8873,[212]2.8886,[213]2.8929,[214]2.8979,[215]2.9013,[216]2.9057,[217]2.9101,[218]2.9178,[219]2.9130,[220]2.9120,[221]2.9100,[222]2.9126,[223]2.9124,[224]2.9189,[225]2.9212,[226]2.9275,[227]2.9253,[228]2.9249,[229]2.9162,[230]2.9085,[231]2.9046,[232]2.9049,[233]2.9034,[234]2.8968,[235]2.8866,[236]2.8795,[237]2.8716,[238]2.8745,[239]2.8886,[240]2.9030,[241]2.9150,[242]2.9255,[243]2.9374,[244]2.9500,[245]2.9637,[246]2.9746,[247]2.9879,[248]2.9986,[249]3.0000,[250]3.0002,[251]2.9900,[252]2.9813,[253]2.9741,[254]2.9710,[255]2.9732,[256]2.9725,[257]2.9681,[258]2.9668,[259]2.9578,[260]2.9518,[261]2.9452,[262]2.9399,[263]2.9340,[264]2.9298,[265]2.9257,[266]2.9231,[267]2.9162,[268]2.9101,[269]2.9062,[270]2.9048,[271]2.9026,[272]2.8979,[273]2.8954,[274]2.8875,[275]2.8806,[276]2.8701,[277]2.8626,[278]2.8535,[279]2.8548,[280]2.8580,[281]2.8616,[282]2.8667,[283]2.8718,[284]2.8733,[285]2.8746,[286]2.8822,[287]2.8928,[288]2.8942,[289]2.8954,[290]2.9000,[291]2.9027,[292]2.8988,[293]2.8906,[294]2.8852,[295]2.8832,[296]2.8771,[297]2.8728,[298]2.8685,[299]2.8644,[300]2.8638,[301]2.8628,[302]2.8594,[303]2.8567,[304]2.8528,[305]2.8469,[306]2.8424,[307]2.8447,[308]2.8507,[309]2.8619,[310]2.8531,[311]2.8476,[312]2.8407,[313]2.8372,[314]2.8331,[315]2.8319,[316]2.8295,[317]2.8284,[318]2.8279,[319]2.8250,[320]2.8231,[321]2.8253,[322]2.8260,[323]2.8204,[324]2.8168,[325]2.8152,[326]2.8128,[327]2.8148,[328]2.8134,[329]2.8135,[330]2.8127,[331]2.8089,[332]2.8107,[333]2.8133,[334]2.8163,[335]2.8164,[336]2.8174,[337]2.8187,[338]2.8192,[339]2.8192,[340]2.8218,[341]2.8244,[342]2.8261,[343]2.8316,[344]2.8364,[345]2.8451,[346]2.8449,[347]2.8382,[348]2.8320,[349]2.8272,[350]2.8213,[351]2.8150,[352]2.8126,[353]2.8092,[354]2.8035,[355]2.7984,[356]2.7946,[357]2.7895,[358]2.7845,[359]2.7837,[360]2.7790,[361]2.7730,[362]2.7670,[363]2.7619,[364]2.7601,[365]2.7560,[366]2.7535,[367]2.7487,[368]2.7430,[369]2.7386,[370]2.7365,[371]2.7324,[372]2.7324,[373]2.7316,[374]2.7334,[375]2.7305,[376]2.7268,[377]2.7233,[378]2.7215,[379]2.7225,[380]2.7178,[381]2.7150,[382]2.7119,[383]2.7146,[384]2.7210,[385]2.7258,[386]2.7335,[387]2.7382,[388]2.7439,[389]2.7517,[390]2.7543,[391]2.7476,[392]2.7421,[393]2.7359,[394]2.7353,[395]2.7298,[396]2.7253,[397]2.7194,[398]2.7130,[399]2.7079,[400]2.7023,[401]2.6961,[402]2.6906,[403]2.6844,[404]2.6780,[405]2.6727,[406]2.6663,[407]2.6604,[408]2.6543,[409]2.6497,[410]2.6440,[411]2.6389,[412]2.6351,[413]2.6317,[414]2.6300,[415]2.6274,[416]2.6250,[417]2.6201,[418]2.6147,[419]2.6202,[420]2.6160,[421]2.6141,[422]2.6161,[423]2.6136,[424]2.6093,[425]2.6059,[426]2.6036,[427]2.6019,[428]2.5990,[429]2.5947,[430]2.5914,[431]2.5926,[432]2.5890,[433]2.5852,[434]2.5821,[435]2.5788,[436]2.5740,[437]2.5688,[438]2.5648,[439]2.5640,[440]2.5608,[441]2.5589,[442]2.5548,[443]2.5603,[444]2.5678,[445]2.5659,[446]2.5650,[447]2.5673,[448]2.5690,[449]2.5750,[450]2.5764,[451]2.5785,[452]2.5825,[453]2.5899,[454]2.5952,[455]2.5982,[456]2.6037,[457]2.6025,[458]2.6064,[459]2.6089,[460]2.6154,[461]2.6214,[462]2.6247,[463]2.6249,[464]2.6238,[465]2.6232,[466]2.6280,[467]2.6276,[468]2.6250,[469]2.6305,[470]2.6321,[471]2.6348,[472]2.6381,[473]2.6400,[474]2.6417,[475]2.6438,[476]2.6465,[477]2.6500,[478]2.6525,[479]2.6550,[480]2.6571,[481]2.6606,[482]2.6627,[483]2.6658,[484]2.6632,[485]2.6676,[486]2.6698,[487]2.6760,[488]2.6812,[489]2.6869,[490]2.6865,[491]2.6921,[492]2.6970,[493]2.7009,[494]2.7056,[495]2.7110,[496]2.7111,[497]2.7126,[498]2.7146,[499]2.7171,[500]2.7204,[501]2.7217,[502]2.7235,[503]2.7285,[504]2.7339,[505]2.7348,[506]2.7351,[507]2.7368,[508]2.7405,[509]2.7463,[510]2.7493,[511]2.7540,[512]2.7488,[513]2.7442,[514]2.7394,[515]2.7362,[516]2.7334,[517]2.7309,[518]2.7274,[519]2.7232,[520]2.7214,[521]2.7181,[522]2.7142,[523]2.7111,[524]2.7137,[525]2.7111,[526]2.7077,[527]2.7077,[528]2.7056,[529]2.7020,[530]2.6989,[531]2.6965,[532]2.6953,[533]2.6930,[534]2.6921,[535]2.6900,[536]2.6881,[537]2.6835,[538]2.6798,[539]2.6760,[540]2.6756,[541]2.6753,[542]2.6732,[543]2.6716,[544]2.6713,[545]2.6695,[546]2.6693,[547]2.6663,[548]2.6641,[549]2.6614,[550]2.6572,[551]2.6526,[552]2.6489,[553]2.6450,[554]2.6413,[555]2.6371,[556]2.6337,[557]2.6295,[558]2.6293,[559]2.6263,[560]2.6254,[561]2.6261,[562]2.6265,[563]2.6294,[564]2.6314,[565]2.6298,
llama_print_timings:        load time =  568847.84 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2025714.03 ms / 289280 tokens (    7.00 ms per token,   142.80 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2036661.57 ms / 289281 tokens

Final estimate: PPL over 565 chunks for n_ctx=512 = 2.6298 +/- 0.01396

ik: Final estimate: PPL over 565 chunks for n_ctx=512 = 2.6298 +/- 0.01396
mainline: Final estimate: PPL = 2.6301 +/- 0.01396

Okay, I'll start imatrix on the BF16 and try to to get out at least one quant today for further easier testing.

compute_imatrix: 17.51 seconds per pass - ETA 3 hours 57.20 minutes

lol gonna take a while, as i've disabled all the fused ops and using -mla 1 to get importance data on all tensors

ubergarm · 2026-02-15T00:45:40Z

Okay, looking good on cpu-only in my testing. I commented on the Issue that some GGUFs are coming in now: https://huggingface.co/ubergarm/GLM-5-GGUF

gapeleon · 2026-02-15T05:44:02Z

UD-Q2_K_XL is working for me with Six RTX 3090's + most of the model on the CPU.

sayap · 2026-02-15T05:50:48Z

Made a IQ1_S_R4 quant for 128G RAM & 24G VRAM. Using -ncmoe 74 to have most FFN tensors on CPU, I notice that VRAM usage on the single 3090 starts lower and stays flat with -mla 1, compared to with -mla 3. PP and TG are about the same.

magikRUKKOLA · 2026-02-15T06:29:15Z

smol-IQ1_KT full offload rtx 3090.

#1265 (comment)

Everything is working.

ikawrakow · 2026-02-15T06:49:11Z

@sayap

Made a IQ1_S_R4 quant for 128G RAM & 24G VRAM. Using -ncmoe 74 to have most FFN tensors on CPU, I notice that VRAM usage on the single 3090 starts lower and stays flat with -mla 1, compared to with -mla 3. PP and TG are about the same.

You see the difference between mla=3 and mla=1 for PP at long context (TG is exactly the same). For fully offloaded models the difference is ~2X at a context of 100k tokens. In hybrid inference, where PP is dominated (or at least heavily influenced) by the time it takes to offload the MoE tensors to the GPU, the difference will be of course much less, and you may not even notice it if you haven't gone out well beyond 10k tokens context.

But yes, mla=3 uses more VRAM. You can mitigate this to some extent by adding -amb 512 to your command line.

ikawrakow · 2026-02-15T06:49:30Z

Thank you everybody for testing, I'll merge it.

ubergarm · 2026-02-16T02:34:16Z

It seems like a smart model capable of opencode vibe coding locally, but with A40B it feels noticeably slower than say Kimi-K2.5's A32B.

I could probably be running it better, but it seemd like all the kv-cache went onto a single GPU. It locked up trying to start it using --n-cpu-moe 55 or i wasn't patient perhaps. The dual GPU rig had was running -ctk f16 and seems about 0.5 tok/sec faster later when i tried -ctk q8_0 like is shown for the CPU rig... anyway,

👈 Details

AMD Thread Ripper Pro (Zen 4) 7965WX 24x Core 8x32GiB DDR5@4800 (221.41 GB/s via mlc) + Dual RTX A6000 (48GB VRAM each)\nDriver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!

./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 131072 \
    -ctk q8_0 \
    -ger \
    --merge-qkv \
    -mla 3 -amb 2048 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(64|65|66|67|68|69|70|71|72|73|74|75|76|77)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    --n-predict 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	15.145	270.45	8.085	7.92
4096	64	4096	17.705	231.35	9.548	6.70
4096	64	8192	20.260	202.18	10.670	6.00
4096	64	12288	22.753	180.02	11.735	5.45
4096	64	16384	25.351	161.57	12.930	4.95
4096	64	20480	27.924	146.68	14.074	4.55
4096	64	24576	30.351	134.95	15.228	4.20
4096	64	28672	33.014	124.07	16.263	3.94
4096	64	32768	35.612	115.02	17.470	3.66
4096	64	36864	38.213	107.19	18.544	3.45
4096	64	40960	40.919	100.10	19.672	3.25
4096	64	45056	43.957	93.18	21.018	3.04
4096	64	49152	46.091	88.87	22.050	2.90
4096	64	53248	48.716	84.08	23.253	2.75
4096	64	57344	51.570	79.43	24.529	2.61
4096	64	61440	54.003	75.85	25.641	2.50
4096	64	65536	56.906	71.98	26.761	2.39
4096	64	69632	59.505	68.83	28.128	2.28
4096	64	73728	62.009	66.05	29.170	2.19
4096	64	77824	64.813	63.20	30.358	2.11
4096	64	81920	67.217	60.94	32.022	2.00
4096	64	86016	70.160	58.38	33.060	1.94
4096	64	90112	72.723	56.32	34.276	1.87
4096	64	94208	75.524	54.23	35.663	1.79
4096	64	98304	78.030	52.49	36.861	1.74
4096	64	102400	80.761	50.72	38.035	1.68
4096	64	106496	83.356	49.14	39.519	1.62
4096	64	110592	85.876	47.70	40.791	1.57

AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (267.74 GB/s via mlc)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 131072 \
    -ctk q8_0 \
    -ger \
    --merge-qkv \
    -mla 3 \
    --threads 92 \
    --threads-batch 128 \
    -ub 4096 -b 4096 \
    --no-mmap \
    --numa numactl \
    --warmup-batch \
    --n-predict 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	22.133	185.06	4.099	15.61
4096	64	4096	25.907	158.10	4.679	13.68
4096	64	8192	31.938	128.25	5.082	12.59
4096	64	12288	36.653	111.75	6.243	10.25
4096	64	16384	47.737	85.80	6.502	9.84
4096	64	20480	53.877	76.03	6.829	9.37
4096	64	24576	58.889	69.55	6.879	9.30
4096	64	28672	63.065	64.95	7.222	8.86
4096	64	32768	73.853	55.46	7.427	8.62
4096	64	36864	80.356	50.97	7.667	8.35
4096	64	40960	88.664	46.20	8.059	7.94
4096	64	45056	98.211	41.71	8.133	7.87
4096	64	49152	97.049	42.21	8.461	7.56
4096	64	53248	103.175	39.70	8.738	7.32
4096	64	57344	109.571	37.38	8.872	7.21
4096	64	61440	110.894	36.94	9.309	6.88
4096	64	65536	120.752	33.92	9.396	6.81
4096	64	69632	127.047	32.24	9.666	6.62
4096	64	73728	132.885	30.82	10.106	6.33
4096	64	77824	134.909	30.36	10.103	6.33
4096	64	81920	140.021	29.25	10.503	6.09
4096	64	86016	147.248	27.82	10.741	5.96
4096	64	90112	151.927	26.96	10.951	5.84
4096	64	94208	160.152	25.58	11.260	5.68
4096	64	98304	161.398	25.38	11.430	5.60
4096	64	102400	171.083	23.94	11.781	5.43
4096	64	106496	173.032	23.67	12.085	5.30
4096	64	110592	179.535	22.81	12.229	5.23
4096	64	114688	188.600	21.72	12.684	5.05
4096	64	118784	192.993	21.22	12.799	5.00
4096	64	122880	200.297	20.45	13.056	4.90
4096	64	126976	208.264	19.67	13.319	4.81

Off topic I had to tease Ling-2.5-1T about their A63B haha...

magikRUKKOLA · 2026-02-16T02:43:30Z

@ubergarm

Off topic I had to tease Ling-2.5-1T about their A63B haha...

Yeah, the only way out of it is 24 channel DDR5 (2x12 EPYC) ...

[EDIT]: or, perhaps 16x3090 haha

magikRUKKOLA · 2026-02-18T23:19:34Z

@ubergarm

Threadripper PRO 3975wx, DDR4 (8x32) 3000 MT/s ("configured" frequency as reported by dmidecode -t memory? possibly just 3200 MT/s as configured in BIOS ); 2x3090. IQ2_KL bench:

main: n_kv_max = 202752, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

 |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
 |-------|--------|--------|----------|----------|----------|----------|
 |  4096 |   1024 |      0 |   26.183 |   156.44 |  144.328 |     7.09 |
 |  4096 |   1024 |   4096 |   23.768 |   172.33 |  106.390 |     9.62 |
 |  4096 |   1024 |   8192 |   25.058 |   163.46 |  108.449 |     9.44 |
 |  4096 |   1024 |  12288 |   27.053 |   151.40 |  110.740 |     9.25 |
 |  4096 |   1024 |  16384 |   28.997 |   141.26 |  113.434 |     9.03 |
 |  4096 |   1024 |  20480 |   31.012 |   132.08 |  115.806 |     8.84 |
 |  4096 |   1024 |  24576 |   33.097 |   123.76 |  118.370 |     8.65 |
 |  4096 |   1024 |  28672 |   35.234 |   116.25 |  121.127 |     8.45 |
 |  4096 |   1024 |  32768 |   37.173 |   110.19 |  124.167 |     8.25 |
 |  4096 |   1024 |  36864 |   39.440 |   103.85 |  126.486 |     8.10 |
 |  4096 |   1024 |  40960 |   41.472 |    98.76 |  129.209 |     7.93 |
 |  4096 |   1024 |  45056 |   43.464 |    94.24 |  132.391 |     7.73 |
 |  4096 |   1024 |  49152 |   45.521 |    89.98 |  134.643 |     7.61 |
 |  4096 |   1024 |  53248 |   47.513 |    86.21 |  137.547 |     7.44 |
 |  4096 |   1024 |  57344 |   49.618 |    82.55 |  140.538 |     7.29 |
 |  4096 |   1024 |  61440 |   51.612 |    79.36 |  143.031 |     7.16 |
 |  4096 |   1024 |  65536 |   53.687 |    76.29 |  145.502 |     7.04 |
 |  4096 |   1024 |  69632 |   55.792 |    73.42 |  148.412 |     6.90 |

How come the decode is significantly better than yours?

[EDIT]:

Ah! I see! Someone recently asked me why am I saying that the partial offload of MLA MoE results in drop of performance. Well, that's why.

I also ran the perplexity test for q8_0 kv with khad. It matches your results.

Final estimate: PPL over 565 chunks for n_ctx=512 = 3.0234 +/- 0.01654

*(yours is 3.0217)

    -ot "blk\.(3|4|5|6|7|8|9|10|11|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(64|65|66|67|68|69|70|71|72|73|74|75|76|77)\.ffn_(gate|up|down)_exps.*=CUDA1" \

So you have two beefy GPUs with 48GB each and you thought that you could boost the decode by offloading some layers to the free VRAM of each GPU, isn't it?

[EDIT2]: I see about zero difference in performance between q8_0 khad and f16 kv-caches.

[EDIT3]: -rtr seems to be helping a little bit

main: n_kv_max = 202752, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   20.573 |   199.10 |  100.267 |    10.21 |
|  4096 |   1024 |   4096 |   22.431 |   182.60 |  103.879 |     9.86 |
|  4096 |   1024 |   8192 |   24.483 |   167.30 |  106.619 |     9.60 |
|  4096 |   1024 |  12288 |   26.539 |   154.34 |  109.323 |     9.37 |
|  4096 |   1024 |  16384 |   28.556 |   143.44 |  112.025 |     9.14 |
|  4096 |   1024 |  20480 |   30.678 |   133.52 |  114.591 |     8.94 |
|  4096 |   1024 |  24576 |   32.734 |   125.13 |  117.374 |     8.72 |
|  4096 |   1024 |  28672 |   34.924 |   117.28 |  120.091 |     8.53 |
|  4096 |   1024 |  32768 |   36.979 |   110.77 |  122.779 |     8.34 |
|  4096 |   1024 |  36864 |   39.076 |   104.82 |  125.548 |     8.16 |
|  4096 |   1024 |  40960 |   41.229 |    99.35 |  128.751 |     7.95 |
|  4096 |   1024 |  45056 |   43.322 |    94.55 |  131.759 |     7.77 |
|  4096 |   1024 |  49152 |   45.393 |    90.23 |  134.542 |     7.61 |
|  4096 |   1024 |  53248 |   47.450 |    86.32 |  138.740 |     7.38 |
|  4096 |   1024 |  57344 |   49.526 |    82.70 |  140.186 |     7.30 |
|  4096 |   1024 |  61440 |   51.546 |    79.46 |  143.028 |     7.16 |
|  4096 |   1024 |  65536 |   53.599 |    76.42 |  145.684 |     7.03 |
|  4096 |   1024 |  69632 |   55.642 |    73.61 |  147.907 |     6.92 |
|  4096 |   1024 |  73728 |   53.159 |    77.05 |  151.383 |     6.76 |
|  4096 |   1024 |  77824 |   54.653 |    74.95 |  153.551 |     6.67 |
|  4096 |   1024 |  81920 |   56.584 |    72.39 |  156.743 |     6.53 |
|  4096 |   1024 |  86016 |   58.340 |    70.21 |  159.711 |     6.41 |
|  4096 |   1024 |  90112 |   59.891 |    68.39 |  162.103 |     6.32 |
|  4096 |   1024 |  94208 |   61.802 |    66.28 |  165.163 |     6.20 |
|  4096 |   1024 |  98304 |   63.479 |    64.53 |  167.696 |     6.11 |
|  4096 |   1024 | 102400 |   65.416 |    62.61 |  169.621 |     6.04 |
|  4096 |   1024 | 106496 |   67.154 |    60.99 |  173.650 |     5.90 |
|  4096 |   1024 | 110592 |   68.601 |    59.71 |  175.442 |     5.84 |
|  4096 |   1024 | 114688 |   70.407 |    58.18 |  179.221 |     5.71 |
|  4096 |   1024 | 118784 |   72.442 |    56.54 |  181.863 |     5.63 |
|  4096 |   1024 | 122880 |   74.356 |    55.09 |  184.925 |     5.54 |
|  4096 |   1024 | 126976 |   75.837 |    54.01 |  187.058 |     5.47 |

InfernalDread · 2026-02-22T18:40:58Z

Made a IQ1_S_R4 quant for 128G RAM & 24G VRAM. Using -ncmoe 74 to have most FFN tensors on CPU, I notice that VRAM usage on the single 3090 starts lower and stays flat with -mla 1, compared to with -mla 3. PP and TG are about the same.

Hello! Is it possible for you to share that quant on Huggingface? I would really like to try it if possible. Thank you!

magikRUKKOLA · 2026-02-23T03:52:19Z

@ubergarm

Off topic I had to tease Ling-2.5-1T about their A63B haha...

Apparently its a garbage LLM (see the test from xCreate on youtube). Not worth time investing.

sayap · 2026-02-28T04:37:13Z

@InfernalDread I uploaded it to https://huggingface.co/sokann/GLM-5-GGUF-1.594bpw, no README yet 😅

InfernalDread · 2026-02-28T04:48:38Z

@InfernalDread I uploaded it to https://huggingface.co/sokann/GLM-5-GGUF-1.594bpw, no README yet 😅

Thank you very much! Can't wait to try it out and see if this model can still pack a punch with this much quantization!

GLM-5 support

9d9c626

ikawrakow marked this pull request as ready for review February 14, 2026 07:15

ikawrakow merged commit 528cadb into main Feb 15, 2026

ubergarm mentioned this pull request Feb 19, 2026

Qwen3.5-MoE support #1288

Merged

Conversation

ikawrakow commented Feb 14, 2026

Uh oh!

Ph0rk0z commented Feb 14, 2026

Uh oh!

ubergarm commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Feb 15, 2026

Uh oh!

gapeleon commented Feb 15, 2026

Uh oh!

sayap commented Feb 15, 2026

Uh oh!

magikRUKKOLA commented Feb 15, 2026

Uh oh!

ikawrakow commented Feb 15, 2026

Uh oh!

ikawrakow commented Feb 15, 2026

Uh oh!

ubergarm commented Feb 16, 2026

AMD Thread Ripper Pro (Zen 4) 7965WX 24x Core 8x32GiB DDR5@4800 (221.41 GB/s via mlc) + Dual RTX A6000 (48GB VRAM each)\nDriver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!

AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (267.74 GB/s via mlc)

Uh oh!

magikRUKKOLA commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

InfernalDread commented Feb 22, 2026

Uh oh!

magikRUKKOLA commented Feb 23, 2026

Uh oh!

sayap commented Feb 28, 2026

Uh oh!

InfernalDread commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ubergarm commented Feb 14, 2026 •

edited

Loading

magikRUKKOLA commented Feb 16, 2026 •

edited

Loading

magikRUKKOLA commented Feb 18, 2026 •

edited

Loading