Skip to content

GLM-5 support#1268

Merged
ikawrakow merged 1 commit intomainfrom
ik/glm5
Feb 15, 2026
Merged

GLM-5 support#1268
ikawrakow merged 1 commit intomainfrom
ik/glm5

Conversation

@ikawrakow
Copy link
Owner

As in the mainline PR no DSA and no MTP.

It just reuses the DeepSeek2 arch.

Closes #1265

@ikawrakow ikawrakow marked this pull request as ready for review February 14, 2026 07:15
@Ph0rk0z
Copy link

Ph0rk0z commented Feb 14, 2026

How well does it run? It's quite the fat boy.

@ubergarm
Copy link
Contributor

ubergarm commented Feb 14, 2026

Loaded and running llama-perplexity on the full BF16 GGUF now and will compare to mainline results. If it is looking good, I'll cook imatrix and try to release a rather quantized smol-IQ1_KT which might be the only thing some rigs can fit for testing.

👈 Details
model=/mnt/data/models/ubergarm/GLM-5-GGUF/GLM-256x22B-5-BF16-00001-of-00033.gguf
numactl --interleave=all \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa distribute \
    --threads 160 \
    --threads-batch 192 \
    --validate-quants \
    --no-mmap

SOCKET is set to: 0
main: build = 4193 (9d9c6261)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337
CPU: using device CPU - 0 MiB free
llama_model_loader: additional 32 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 58 key-value pairs and 1809 tensors from /mnt/data/models/ubergarm/GLM-5-GGUF/GLM-256x22B-5-BF16-00001-of-00033.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm-dsa
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   3:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   4:                               general.name str              = GLM 5
llama_model_loader: - kv   5:                            general.version str              = 5
llama_model_loader: - kv   6:                           general.basename str              = GLM
llama_model_loader: - kv   7:                         general.size_label str              = 256x22B
llama_model_loader: - kv   8:                            general.license str              = mit
llama_model_loader: - kv   9:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  10:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv  11:                        glm-dsa.block_count u32              = 79
llama_model_loader: - kv  12:                     glm-dsa.context_length u32              = 202752
llama_model_loader: - kv  13:                   glm-dsa.embedding_length u32              = 6144
llama_model_loader: - kv  14:                glm-dsa.feed_forward_length u32              = 12288
llama_model_loader: - kv  15:               glm-dsa.attention.head_count u32              = 64
llama_model_loader: - kv  16:            glm-dsa.attention.head_count_kv u32              = 1
llama_model_loader: - kv  17:                     glm-dsa.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  18:   glm-dsa.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  glm-dsa.expert_used_count u32              = 8
llama_model_loader: - kv  20:                 glm-dsa.expert_group_count u32              = 1
llama_model_loader: - kv  21:            glm-dsa.expert_group_used_count u32              = 1
llama_model_loader: - kv  22:                 glm-dsa.expert_gating_func u32              = 2
llama_model_loader: - kv  23:               glm-dsa.attention.key_length u32              = 576
llama_model_loader: - kv  24:             glm-dsa.attention.value_length u32              = 512
llama_model_loader: - kv  25:                          general.file_type u32              = 32
llama_model_loader: - kv  26:          glm-dsa.leading_dense_block_count u32              = 3
llama_model_loader: - kv  27:                         glm-dsa.vocab_size u32              = 154880
llama_model_loader: - kv  28:              glm-dsa.attention.q_lora_rank u32              = 2048
llama_model_loader: - kv  29:             glm-dsa.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  30:           glm-dsa.attention.key_length_mla u32              = 256
llama_model_loader: - kv  31:         glm-dsa.attention.value_length_mla u32              = 256
llama_model_loader: - kv  32:         glm-dsa.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  33:                       glm-dsa.expert_count u32              = 256
llama_model_loader: - kv  34:                glm-dsa.expert_shared_count u32              = 1
llama_model_loader: - kv  35:               glm-dsa.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  36:                glm-dsa.expert_weights_norm bool             = true
llama_model_loader: - kv  37:               glm-dsa.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:               glm-dsa.nextn_predict_layers u32              = 1
llama_model_loader: - kv  39:       glm-dsa.attention.indexer.head_count u32              = 32
llama_model_loader: - kv  40:       glm-dsa.attention.indexer.key_length u32              = 128
llama_model_loader: - kv  41:            glm-dsa.attention.indexer.top_k u32              = 2048
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - kv  43:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  44:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  45:                      tokenizer.ggml.tokens arr[str,154880]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  46:                  tokenizer.ggml.token_type arr[i32,154880]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  47:                      tokenizer.ggml.merges arr[str,321649]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  48:                tokenizer.ggml.eos_token_id u32              = 154820
llama_model_loader: - kv  49:            tokenizer.ggml.padding_token_id u32              = 154820
llama_model_loader: - kv  50:                tokenizer.ggml.bos_token_id u32              = 154822
llama_model_loader: - kv  51:                tokenizer.ggml.eot_token_id u32              = 154827
llama_model_loader: - kv  52:            tokenizer.ggml.unknown_token_id u32              = 154820
llama_model_loader: - kv  53:                tokenizer.ggml.eom_token_id u32              = 154829
llama_model_loader: - kv  54:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv  55:                                   split.no u16              = 0
llama_model_loader: - kv  56:                                split.count u16              = 33
llama_model_loader: - kv  57:                        split.tensors.count i32              = 1809
llama_model_loader: - type  f32:  630 tensors
llama_model_loader: - type bf16: 1179 tensors
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 154820 ('<|endoftext|>')
load:   - 154827 ('<|user|>')
load:   - 154829 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9811 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = glm-dsa
llm_load_print_meta: n_ctx_train      = 202752
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 79
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 16384
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 202752
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 744B.A40B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 753.864 B
llm_load_print_meta: model size       = 1404.406 GiB (16.003 BPW) 
llm_load_print_meta: repeating layers = 1400.861 GiB (16.003 BPW, 751.961 B parameters)
llm_load_print_meta: general.name     = GLM 5
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 2048
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.0000
print_info: vocab type       = BPE
print_info: n_vocab          = 154880
print_info: n_merges         = 321649
print_info: BOS token        = 154822 '[gMASK]'
print_info: EOS token        = 154820 '<|endoftext|>'
print_info: EOT token        = 154827 '<|user|>'
print_info: EOM token        = 154829 '<|observation|>'
print_info: UNK token        = 154820 '<|endoftext|>'
print_info: PAD token        = 154820 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 154838 '<|code_prefix|>'
print_info: FIM SUF token    = 154840 '<|code_suffix|>'
print_info: FIM MID token    = 154839 '<|code_middle|>'
print_info: EOG token        = 154820 '<|endoftext|>'
print_info: EOG token        = 154827 '<|user|>'
print_info: EOG token        = 154829 '<|observation|>'
print_info: max token length = 1024
llm_load_tensors: ggml ctx size =    0.72 MiB
model has unused tensor blk.78.attn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a_norm.weight (size = 8192 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_norm.weight (size = 2048 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.attn_q_b.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_mqa.weight (size = 7077888 bytes) -- ignoring
model has unused tensor blk.78.attn_output.weight (size = 201326592 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.proj.weight (size = 393216 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_k.weight (size = 1572864 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_q_b.weight (size = 16777216 bytes) -- ignoring
model has unused tensor blk.78.ffn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_inp.weight (size = 6291456 bytes) -- ignoring
model has unused tensor blk.78.exp_probs_b.bias (size = 1024 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.nextn.eh_proj.weight (size = 150994944 bytes) -- ignoring
model has unused tensor blk.78.nextn.enorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.shared_head_norm.weight (size = 24576 bytes) -- ignoring
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/80 layers to GPU
llm_load_tensors:        CPU buffer size = 1419153.34 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 79 wkv_b tensors
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
Computed blk.0.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.1.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.2.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.3.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.4.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.5.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.6.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.7.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.8.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.9.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.10.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.11.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.12.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.13.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.14.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.15.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.16.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.17.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.18.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.19.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.20.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.21.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.22.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.23.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.24.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.25.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.26.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.27.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.28.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.29.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.30.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.31.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.32.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.33.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.34.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.35.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.36.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.37.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.38.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.39.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.40.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.41.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.42.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.43.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.44.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.45.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.46.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.47.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.48.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.49.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.50.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.51.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.52.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.53.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.54.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.55.allama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_batch       = 4096
llama_init_from_model: n_ubatch      = 4096
llama_init_from_model: flash_attn    = 1
llama_init_from_model: mla_attn      = 3
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =   351.00 MiB
llama_init_from_model: KV self size  =  351.00 MiB, c^KV (f16):  351.00 MiB, kv^T: not used
llama_init_from_model:        CPU  output buffer size =     4.73 MiB
llama_init_from_model:        CPU compute buffer size =  2516.00 MiB
llama_init_from_model: graph nodes  = 4322
llama_init_from_model: graph splits = 1
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

system_info: n_threads = 160 (n_threads_batch = 192) / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 806.999 ms
perplexity: calculating perplexity over 565 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 30.39 seconds per pass - ETA 35.77 minutes
ttn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.56.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.57.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.58.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.59.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.61.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.62.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.63.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.64.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.65.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.66.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.67.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.68.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.69.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.70.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.71.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.72.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.73.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.74.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.75.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.76.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.77.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.78.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
===================================== llama_init_from_model: f16
======================================= HAVE_FANCY_SIMD is defined
[1]1.2323,[2]1.9955,[3]1.7660,[4]1.5587,[5]1.4515,[6]1.3944,[7]1.3624,[8]1.3339,[9]1.3250,[10]1.3046,[11]1.2953,[12]1.3352,[13]1.3332,[14]1.3939,[15]1.4936,[16]1.5983,[17]1.7117,[18]1.8544,[19]1.8493,[20]1.8449,[21]1.9193,[22]1.9402,[23]1.9294,[24]1.9136,[25]1.9017,[26]1.9003,[27]1.9125,[28]1.9398,[29]1.9552,[30]2.0111,[31]2.0660,[32]2.1039,[33]2.1476,[34]2.1767,[35]2.2220,[36]2.2603,[37]2.2864,[38]2.3722,[39]2.4110,[40]2.4555,[41]2.5189,[42]2.5115,[43]2.5278,[44]2.5554,[45]2.6261,[46]2.6770,[47]2.6392,[48]2.6017,[49]2.5755,[50]2.5630,[51]2.5790,[52]2.6066,[53]2.6430,[54]2.6685,[55]2.6946,[56]2.7218,[57]2.7202,[58]2.7435,[59]2.7584,[60]2.7930,[61]2.8250,[62]2.8721,[63]2.9093,[64]2.9362,[65]2.9533,[66]2.9495,[67]2.9246,[68]2.9075,[69]2.9296,[70]2.9142,[71]2.8983,[72]2.8982,[73]2.9047,[74]2.9300,[75]2.9327,[76]2.8994,[77]2.8663,[78]2.8372,[79]2.8100,[80]2.7875,[81]2.7640,[82]2.7522,[83]2.7503,[84]2.7261,[85]2.7136,[86]2.7047,[87]2.6949,[88]2.6780,[89]2.6583,[90]2.6464,[91]2.6281,[92]2.6078,[93]2.5970,[94]2.5820,[95]2.5686,[96]2.5604,[97]2.5682,[98]2.5607,[99]2.5472,[100]2.5284,[101]2.5370,[102]2.5224,[103]2.5127,[104]2.5063,[105]2.5162,[106]2.5392,[107]2.5887,[108]2.5989,[109]2.6081,[110]2.6419,[111]2.6646,[112]2.6445,[113]2.6324,[114]2.6316,[115]2.6293,[116]2.6349,[117]2.6363,[118]2.6405,[119]2.6452,[120]2.6418,[121]2.6332,[122]2.6365,[123]2.6242,[124]2.6237,[125]2.6251,[126]2.6255,[127]2.6244,[128]2.6382,[129]2.6443,[130]2.6434,[131]2.6554,[132]2.6550,[133]2.6534,[134]2.6673,[135]2.6846,[136]2.6782,[137]2.6739,[138]2.6700,[139]2.6586,[140]2.6720,[141]2.6739,[142]2.6658,[143]2.6646,[144]2.6655,[145]2.6647,[146]2.6597,[147]2.6476,[148]2.6422,[149]2.6384,[150]2.6354,[151]2.6285,[152]2.6277,[153]2.6313,[154]2.6310,[155]2.6315,[156]2.6354,[157]2.6379,[158]2.6401,[159]2.6507,[160]2.6597,[161]2.6656,[162]2.6557,[163]2.6441,[164]2.6472,[165]2.6382,[166]2.6356,[167]2.6481,[168]2.6484,[169]2.6716,[170]2.6868,[171]2.6973,[172]2.7149,[173]2.7066,[174]2.6949,[175]2.6827,[176]2.6718,[177]2.6590,[178]2.6456,[179]2.6350,[180]2.6234,[181]2.6184,[182]2.6319,[183]2.6488,[184]2.6725,[185]2.6898,[186]2.7002,[187]2.7187,[188]2.7413,[189]2.7625,[190]2.7781,[191]2.7931,[192]2.8025,[193]2.8092,[194]2.8138,[195]2.8119,[196]2.8142,[197]2.8272,[198]2.8416,[199]2.8416,[200]2.8485,[201]2.8510,[202]2.8543,[203]2.8531,[204]2.8615,[205]2.8697,[206]2.8765,[207]2.8833,[208]2.8839,[209]2.8869,[210]2.8829,[211]2.8873,[212]2.8886,[213]2.8929,[214]2.8979,[215]2.9013,[216]2.9057,[217]2.9101,[218]2.9178,[219]2.9130,[220]2.9120,[221]2.9100,[222]2.9126,[223]2.9124,[224]2.9189,[225]2.9212,[226]2.9275,[227]2.9253,[228]2.9249,[229]2.9162,[230]2.9085,[231]2.9046,[232]2.9049,[233]2.9034,[234]2.8968,[235]2.8866,[236]2.8795,[237]2.8716,[238]2.8745,[239]2.8886,[240]2.9030,[241]2.9150,[242]2.9255,[243]2.9374,[244]2.9500,[245]2.9637,[246]2.9746,[247]2.9879,[248]2.9986,[249]3.0000,[250]3.0002,[251]2.9900,[252]2.9813,[253]2.9741,[254]2.9710,[255]2.9732,[256]2.9725,[257]2.9681,[258]2.9668,[259]2.9578,[260]2.9518,[261]2.9452,[262]2.9399,[263]2.9340,[264]2.9298,[265]2.9257,[266]2.9231,[267]2.9162,[268]2.9101,[269]2.9062,[270]2.9048,[271]2.9026,[272]2.8979,[273]2.8954,[274]2.8875,[275]2.8806,[276]2.8701,[277]2.8626,[278]2.8535,[279]2.8548,[280]2.8580,[281]2.8616,[282]2.8667,[283]2.8718,[284]2.8733,[285]2.8746,[286]2.8822,[287]2.8928,[288]2.8942,[289]2.8954,[290]2.9000,[291]2.9027,[292]2.8988,[293]2.8906,[294]2.8852,[295]2.8832,[296]2.8771,[297]2.8728,[298]2.8685,[299]2.8644,[300]2.8638,[301]2.8628,[302]2.8594,[303]2.8567,[304]2.8528,[305]2.8469,[306]2.8424,[307]2.8447,[308]2.8507,[309]2.8619,[310]2.8531,[311]2.8476,[312]2.8407,[313]2.8372,[314]2.8331,[315]2.8319,[316]2.8295,[317]2.8284,[318]2.8279,[319]2.8250,[320]2.8231,[321]2.8253,[322]2.8260,[323]2.8204,[324]2.8168,[325]2.8152,[326]2.8128,[327]2.8148,[328]2.8134,[329]2.8135,[330]2.8127,[331]2.8089,[332]2.8107,[333]2.8133,[334]2.8163,[335]2.8164,[336]2.8174,[337]2.8187,[338]2.8192,[339]2.8192,[340]2.8218,[341]2.8244,[342]2.8261,[343]2.8316,[344]2.8364,[345]2.8451,[346]2.8449,[347]2.8382,[348]2.8320,[349]2.8272,[350]2.8213,[351]2.8150,[352]2.8126,[353]2.8092,[354]2.8035,[355]2.7984,[356]2.7946,[357]2.7895,[358]2.7845,[359]2.7837,[360]2.7790,[361]2.7730,[362]2.7670,[363]2.7619,[364]2.7601,[365]2.7560,[366]2.7535,[367]2.7487,[368]2.7430,[369]2.7386,[370]2.7365,[371]2.7324,[372]2.7324,[373]2.7316,[374]2.7334,[375]2.7305,[376]2.7268,[377]2.7233,[378]2.7215,[379]2.7225,[380]2.7178,[381]2.7150,[382]2.7119,[383]2.7146,[384]2.7210,[385]2.7258,[386]2.7335,[387]2.7382,[388]2.7439,[389]2.7517,[390]2.7543,[391]2.7476,[392]2.7421,[393]2.7359,[394]2.7353,[395]2.7298,[396]2.7253,[397]2.7194,[398]2.7130,[399]2.7079,[400]2.7023,[401]2.6961,[402]2.6906,[403]2.6844,[404]2.6780,[405]2.6727,[406]2.6663,[407]2.6604,[408]2.6543,[409]2.6497,[410]2.6440,[411]2.6389,[412]2.6351,[413]2.6317,[414]2.6300,[415]2.6274,[416]2.6250,[417]2.6201,[418]2.6147,[419]2.6202,[420]2.6160,[421]2.6141,[422]2.6161,[423]2.6136,[424]2.6093,[425]2.6059,[426]2.6036,[427]2.6019,[428]2.5990,[429]2.5947,[430]2.5914,[431]2.5926,[432]2.5890,[433]2.5852,[434]2.5821,[435]2.5788,[436]2.5740,[437]2.5688,[438]2.5648,[439]2.5640,[440]2.5608,[441]2.5589,[442]2.5548,[443]2.5603,[444]2.5678,[445]2.5659,[446]2.5650,[447]2.5673,[448]2.5690,[449]2.5750,[450]2.5764,[451]2.5785,[452]2.5825,[453]2.5899,[454]2.5952,[455]2.5982,[456]2.6037,[457]2.6025,[458]2.6064,[459]2.6089,[460]2.6154,[461]2.6214,[462]2.6247,[463]2.6249,[464]2.6238,[465]2.6232,[466]2.6280,[467]2.6276,[468]2.6250,[469]2.6305,[470]2.6321,[471]2.6348,[472]2.6381,[473]2.6400,[474]2.6417,[475]2.6438,[476]2.6465,[477]2.6500,[478]2.6525,[479]2.6550,[480]2.6571,[481]2.6606,[482]2.6627,[483]2.6658,[484]2.6632,[485]2.6676,[486]2.6698,[487]2.6760,[488]2.6812,[489]2.6869,[490]2.6865,[491]2.6921,[492]2.6970,[493]2.7009,[494]2.7056,[495]2.7110,[496]2.7111,[497]2.7126,[498]2.7146,[499]2.7171,[500]2.7204,[501]2.7217,[502]2.7235,[503]2.7285,[504]2.7339,[505]2.7348,[506]2.7351,[507]2.7368,[508]2.7405,[509]2.7463,[510]2.7493,[511]2.7540,[512]2.7488,[513]2.7442,[514]2.7394,[515]2.7362,[516]2.7334,[517]2.7309,[518]2.7274,[519]2.7232,[520]2.7214,[521]2.7181,[522]2.7142,[523]2.7111,[524]2.7137,[525]2.7111,[526]2.7077,[527]2.7077,[528]2.7056,[529]2.7020,[530]2.6989,[531]2.6965,[532]2.6953,[533]2.6930,[534]2.6921,[535]2.6900,[536]2.6881,[537]2.6835,[538]2.6798,[539]2.6760,[540]2.6756,[541]2.6753,[542]2.6732,[543]2.6716,[544]2.6713,[545]2.6695,[546]2.6693,[547]2.6663,[548]2.6641,[549]2.6614,[550]2.6572,[551]2.6526,[552]2.6489,[553]2.6450,[554]2.6413,[555]2.6371,[556]2.6337,[557]2.6295,[558]2.6293,[559]2.6263,[560]2.6254,[561]2.6261,[562]2.6265,[563]2.6294,[564]2.6314,[565]2.6298,
llama_print_timings:        load time =  568847.84 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2025714.03 ms / 289280 tokens (    7.00 ms per token,   142.80 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2036661.57 ms / 289281 tokens

Final estimate: PPL over 565 chunks for n_ctx=512 = 2.6298 +/- 0.01396
  • ik: Final estimate: PPL over 565 chunks for n_ctx=512 = 2.6298 +/- 0.01396
  • mainline: Final estimate: PPL = 2.6301 +/- 0.01396

Okay, I'll start imatrix on the BF16 and try to to get out at least one quant today for further easier testing.

compute_imatrix: 17.51 seconds per pass - ETA 3 hours 57.20 minutes

lol gonna take a while, as i've disabled all the fused ops and using -mla 1 to get importance data on all tensors

@ubergarm
Copy link
Contributor

Okay, looking good on cpu-only in my testing. I commented on the Issue that some GGUFs are coming in now: https://huggingface.co/ubergarm/GLM-5-GGUF

@gapeleon
Copy link
Contributor

UD-Q2_K_XL is working for me with Six RTX 3090's + most of the model on the CPU.

@sayap
Copy link
Contributor

sayap commented Feb 15, 2026

Made a IQ1_S_R4 quant for 128G RAM & 24G VRAM. Using -ncmoe 74 to have most FFN tensors on CPU, I notice that VRAM usage on the single 3090 starts lower and stays flat with -mla 1, compared to with -mla 3. PP and TG are about the same.

@magikRUKKOLA
Copy link

smol-IQ1_KT full offload rtx 3090.

#1265 (comment)

Everything is working.

@ikawrakow
Copy link
Owner Author

@sayap

Made a IQ1_S_R4 quant for 128G RAM & 24G VRAM. Using -ncmoe 74 to have most FFN tensors on CPU, I notice that VRAM usage on the single 3090 starts lower and stays flat with -mla 1, compared to with -mla 3. PP and TG are about the same.

You see the difference between mla=3 and mla=1 for PP at long context (TG is exactly the same). For fully offloaded models the difference is ~2X at a context of 100k tokens. In hybrid inference, where PP is dominated (or at least heavily influenced) by the time it takes to offload the MoE tensors to the GPU, the difference will be of course much less, and you may not even notice it if you haven't gone out well beyond 10k tokens context.

But yes, mla=3 uses more VRAM. You can mitigate this to some extent by adding -amb 512 to your command line.

@ikawrakow
Copy link
Owner Author

Thank you everybody for testing, I'll merge it.

@ikawrakow ikawrakow merged commit 528cadb into main Feb 15, 2026
@ubergarm
Copy link
Contributor

It seems like a smart model capable of opencode vibe coding locally, but with A40B it feels noticeably slower than say Kimi-K2.5's A32B.

sweep-bench-GLM-5-IQ2_KL

I could probably be running it better, but it seemd like all the kv-cache went onto a single GPU. It locked up trying to start it using --n-cpu-moe 55 or i wasn't patient perhaps. The dual GPU rig had was running -ctk f16 and seems about 0.5 tok/sec faster later when i tried -ctk q8_0 like is shown for the CPU rig... anyway,

👈 Details

AMD Thread Ripper Pro (Zen 4) 7965WX 24x Core 8x32GiB DDR5@4800 (221.41 GB/s via mlc) + Dual RTX A6000 (48GB VRAM each)\nDriver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!

./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 131072 \
    -ctk q8_0 \
    -ger \
    --merge-qkv \
    -mla 3 -amb 2048 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(64|65|66|67|68|69|70|71|72|73|74|75|76|77)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    --n-predict 64
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 15.145 270.45 8.085 7.92
4096 64 4096 17.705 231.35 9.548 6.70
4096 64 8192 20.260 202.18 10.670 6.00
4096 64 12288 22.753 180.02 11.735 5.45
4096 64 16384 25.351 161.57 12.930 4.95
4096 64 20480 27.924 146.68 14.074 4.55
4096 64 24576 30.351 134.95 15.228 4.20
4096 64 28672 33.014 124.07 16.263 3.94
4096 64 32768 35.612 115.02 17.470 3.66
4096 64 36864 38.213 107.19 18.544 3.45
4096 64 40960 40.919 100.10 19.672 3.25
4096 64 45056 43.957 93.18 21.018 3.04
4096 64 49152 46.091 88.87 22.050 2.90
4096 64 53248 48.716 84.08 23.253 2.75
4096 64 57344 51.570 79.43 24.529 2.61
4096 64 61440 54.003 75.85 25.641 2.50
4096 64 65536 56.906 71.98 26.761 2.39
4096 64 69632 59.505 68.83 28.128 2.28
4096 64 73728 62.009 66.05 29.170 2.19
4096 64 77824 64.813 63.20 30.358 2.11
4096 64 81920 67.217 60.94 32.022 2.00
4096 64 86016 70.160 58.38 33.060 1.94
4096 64 90112 72.723 56.32 34.276 1.87
4096 64 94208 75.524 54.23 35.663 1.79
4096 64 98304 78.030 52.49 36.861 1.74
4096 64 102400 80.761 50.72 38.035 1.68
4096 64 106496 83.356 49.14 39.519 1.62
4096 64 110592 85.876 47.70 40.791 1.57

AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (267.74 GB/s via mlc)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 131072 \
    -ctk q8_0 \
    -ger \
    --merge-qkv \
    -mla 3 \
    --threads 92 \
    --threads-batch 128 \
    -ub 4096 -b 4096 \
    --no-mmap \
    --numa numactl \
    --warmup-batch \
    --n-predict 64
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 22.133 185.06 4.099 15.61
4096 64 4096 25.907 158.10 4.679 13.68
4096 64 8192 31.938 128.25 5.082 12.59
4096 64 12288 36.653 111.75 6.243 10.25
4096 64 16384 47.737 85.80 6.502 9.84
4096 64 20480 53.877 76.03 6.829 9.37
4096 64 24576 58.889 69.55 6.879 9.30
4096 64 28672 63.065 64.95 7.222 8.86
4096 64 32768 73.853 55.46 7.427 8.62
4096 64 36864 80.356 50.97 7.667 8.35
4096 64 40960 88.664 46.20 8.059 7.94
4096 64 45056 98.211 41.71 8.133 7.87
4096 64 49152 97.049 42.21 8.461 7.56
4096 64 53248 103.175 39.70 8.738 7.32
4096 64 57344 109.571 37.38 8.872 7.21
4096 64 61440 110.894 36.94 9.309 6.88
4096 64 65536 120.752 33.92 9.396 6.81
4096 64 69632 127.047 32.24 9.666 6.62
4096 64 73728 132.885 30.82 10.106 6.33
4096 64 77824 134.909 30.36 10.103 6.33
4096 64 81920 140.021 29.25 10.503 6.09
4096 64 86016 147.248 27.82 10.741 5.96
4096 64 90112 151.927 26.96 10.951 5.84
4096 64 94208 160.152 25.58 11.260 5.68
4096 64 98304 161.398 25.38 11.430 5.60
4096 64 102400 171.083 23.94 11.781 5.43
4096 64 106496 173.032 23.67 12.085 5.30
4096 64 110592 179.535 22.81 12.229 5.23
4096 64 114688 188.600 21.72 12.684 5.05
4096 64 118784 192.993 21.22 12.799 5.00
4096 64 122880 200.297 20.45 13.056 4.90
4096 64 126976 208.264 19.67 13.319 4.81

Off topic I had to tease Ling-2.5-1T about their A63B haha...

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 16, 2026

@ubergarm

Off topic I had to tease Ling-2.5-1T about their A63B haha...

Yeah, the only way out of it is 24 channel DDR5 (2x12 EPYC) ...

[EDIT]: or, perhaps 16x3090 haha

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 18, 2026

@ubergarm

Threadripper PRO 3975wx, DDR4 (8x32) 3000 MT/s ("configured" frequency as reported by dmidecode -t memory? possibly just 3200 MT/s as configured in BIOS ); 2x3090. IQ2_KL bench:

main: n_kv_max = 202752, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

 |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
 |-------|--------|--------|----------|----------|----------|----------|
 |  4096 |   1024 |      0 |   26.183 |   156.44 |  144.328 |     7.09 |
 |  4096 |   1024 |   4096 |   23.768 |   172.33 |  106.390 |     9.62 |
 |  4096 |   1024 |   8192 |   25.058 |   163.46 |  108.449 |     9.44 |
 |  4096 |   1024 |  12288 |   27.053 |   151.40 |  110.740 |     9.25 |
 |  4096 |   1024 |  16384 |   28.997 |   141.26 |  113.434 |     9.03 |
 |  4096 |   1024 |  20480 |   31.012 |   132.08 |  115.806 |     8.84 |
 |  4096 |   1024 |  24576 |   33.097 |   123.76 |  118.370 |     8.65 |
 |  4096 |   1024 |  28672 |   35.234 |   116.25 |  121.127 |     8.45 |
 |  4096 |   1024 |  32768 |   37.173 |   110.19 |  124.167 |     8.25 |
 |  4096 |   1024 |  36864 |   39.440 |   103.85 |  126.486 |     8.10 |
 |  4096 |   1024 |  40960 |   41.472 |    98.76 |  129.209 |     7.93 |
 |  4096 |   1024 |  45056 |   43.464 |    94.24 |  132.391 |     7.73 |
 |  4096 |   1024 |  49152 |   45.521 |    89.98 |  134.643 |     7.61 |
 |  4096 |   1024 |  53248 |   47.513 |    86.21 |  137.547 |     7.44 |
 |  4096 |   1024 |  57344 |   49.618 |    82.55 |  140.538 |     7.29 |
 |  4096 |   1024 |  61440 |   51.612 |    79.36 |  143.031 |     7.16 |
 |  4096 |   1024 |  65536 |   53.687 |    76.29 |  145.502 |     7.04 |
 |  4096 |   1024 |  69632 |   55.792 |    73.42 |  148.412 |     6.90 |

How come the decode is significantly better than yours?

[EDIT]:

Ah! I see! Someone recently asked me why am I saying that the partial offload of MLA MoE results in drop of performance. Well, that's why.

I also ran the perplexity test for q8_0 kv with khad. It matches your results.

Final estimate: PPL over 565 chunks for n_ctx=512 = 3.0234 +/- 0.01654

*(yours is 3.0217)

    -ot "blk\.(3|4|5|6|7|8|9|10|11|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(64|65|66|67|68|69|70|71|72|73|74|75|76|77)\.ffn_(gate|up|down)_exps.*=CUDA1" \

So you have two beefy GPUs with 48GB each and you thought that you could boost the decode by offloading some layers to the free VRAM of each GPU, isn't it?

[EDIT2]: I see about zero difference in performance between q8_0 khad and f16 kv-caches.

[EDIT3]: -rtr seems to be helping a little bit

main: n_kv_max = 202752, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   20.573 |   199.10 |  100.267 |    10.21 |
|  4096 |   1024 |   4096 |   22.431 |   182.60 |  103.879 |     9.86 |
|  4096 |   1024 |   8192 |   24.483 |   167.30 |  106.619 |     9.60 |
|  4096 |   1024 |  12288 |   26.539 |   154.34 |  109.323 |     9.37 |
|  4096 |   1024 |  16384 |   28.556 |   143.44 |  112.025 |     9.14 |
|  4096 |   1024 |  20480 |   30.678 |   133.52 |  114.591 |     8.94 |
|  4096 |   1024 |  24576 |   32.734 |   125.13 |  117.374 |     8.72 |
|  4096 |   1024 |  28672 |   34.924 |   117.28 |  120.091 |     8.53 |
|  4096 |   1024 |  32768 |   36.979 |   110.77 |  122.779 |     8.34 |
|  4096 |   1024 |  36864 |   39.076 |   104.82 |  125.548 |     8.16 |
|  4096 |   1024 |  40960 |   41.229 |    99.35 |  128.751 |     7.95 |
|  4096 |   1024 |  45056 |   43.322 |    94.55 |  131.759 |     7.77 |
|  4096 |   1024 |  49152 |   45.393 |    90.23 |  134.542 |     7.61 |
|  4096 |   1024 |  53248 |   47.450 |    86.32 |  138.740 |     7.38 |
|  4096 |   1024 |  57344 |   49.526 |    82.70 |  140.186 |     7.30 |
|  4096 |   1024 |  61440 |   51.546 |    79.46 |  143.028 |     7.16 |
|  4096 |   1024 |  65536 |   53.599 |    76.42 |  145.684 |     7.03 |
|  4096 |   1024 |  69632 |   55.642 |    73.61 |  147.907 |     6.92 |
|  4096 |   1024 |  73728 |   53.159 |    77.05 |  151.383 |     6.76 |
|  4096 |   1024 |  77824 |   54.653 |    74.95 |  153.551 |     6.67 |
|  4096 |   1024 |  81920 |   56.584 |    72.39 |  156.743 |     6.53 |
|  4096 |   1024 |  86016 |   58.340 |    70.21 |  159.711 |     6.41 |
|  4096 |   1024 |  90112 |   59.891 |    68.39 |  162.103 |     6.32 |
|  4096 |   1024 |  94208 |   61.802 |    66.28 |  165.163 |     6.20 |
|  4096 |   1024 |  98304 |   63.479 |    64.53 |  167.696 |     6.11 |
|  4096 |   1024 | 102400 |   65.416 |    62.61 |  169.621 |     6.04 |
|  4096 |   1024 | 106496 |   67.154 |    60.99 |  173.650 |     5.90 |
|  4096 |   1024 | 110592 |   68.601 |    59.71 |  175.442 |     5.84 |
|  4096 |   1024 | 114688 |   70.407 |    58.18 |  179.221 |     5.71 |
|  4096 |   1024 | 118784 |   72.442 |    56.54 |  181.863 |     5.63 |
|  4096 |   1024 | 122880 |   74.356 |    55.09 |  184.925 |     5.54 |
|  4096 |   1024 | 126976 |   75.837 |    54.01 |  187.058 |     5.47 |

@ubergarm ubergarm mentioned this pull request Feb 19, 2026
@InfernalDread
Copy link

Made a IQ1_S_R4 quant for 128G RAM & 24G VRAM. Using -ncmoe 74 to have most FFN tensors on CPU, I notice that VRAM usage on the single 3090 starts lower and stays flat with -mla 1, compared to with -mla 3. PP and TG are about the same.

Hello! Is it possible for you to share that quant on Huggingface? I would really like to try it if possible. Thank you!

@magikRUKKOLA
Copy link

@ubergarm

Off topic I had to tease Ling-2.5-1T about their A63B haha...

Apparently its a garbage LLM (see the test from xCreate on youtube). Not worth time investing.

@sayap
Copy link
Contributor

sayap commented Feb 28, 2026

@InfernalDread I uploaded it to https://huggingface.co/sokann/GLM-5-GGUF-1.594bpw, no README yet 😅

@InfernalDread
Copy link

@InfernalDread I uploaded it to https://huggingface.co/sokann/GLM-5-GGUF-1.594bpw, no README yet 😅

Thank you very much! Can't wait to try it out and see if this model can still pack a punch with this much quantization!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support GLM-5

7 participants