Conversation
|
How well does it run? It's quite the fat boy. |
|
Loaded and running 👈 Detailsmodel=/mnt/data/models/ubergarm/GLM-5-GGUF/GLM-256x22B-5-BF16-00001-of-00033.gguf
numactl --interleave=all \
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
--seed 1337 \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa distribute \
--threads 160 \
--threads-batch 192 \
--validate-quants \
--no-mmap
SOCKET is set to: 0
main: build = 4193 (9d9c6261)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed = 1337
CPU: using device CPU - 0 MiB free
llama_model_loader: additional 32 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 58 key-value pairs and 1809 tensors from /mnt/data/models/ubergarm/GLM-5-GGUF/GLM-256x22B-5-BF16-00001-of-00033.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm-dsa
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 3: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 4: general.name str = GLM 5
llama_model_loader: - kv 5: general.version str = 5
llama_model_loader: - kv 6: general.basename str = GLM
llama_model_loader: - kv 7: general.size_label str = 256x22B
llama_model_loader: - kv 8: general.license str = mit
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 10: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 11: glm-dsa.block_count u32 = 79
llama_model_loader: - kv 12: glm-dsa.context_length u32 = 202752
llama_model_loader: - kv 13: glm-dsa.embedding_length u32 = 6144
llama_model_loader: - kv 14: glm-dsa.feed_forward_length u32 = 12288
llama_model_loader: - kv 15: glm-dsa.attention.head_count u32 = 64
llama_model_loader: - kv 16: glm-dsa.attention.head_count_kv u32 = 1
llama_model_loader: - kv 17: glm-dsa.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: glm-dsa.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: glm-dsa.expert_used_count u32 = 8
llama_model_loader: - kv 20: glm-dsa.expert_group_count u32 = 1
llama_model_loader: - kv 21: glm-dsa.expert_group_used_count u32 = 1
llama_model_loader: - kv 22: glm-dsa.expert_gating_func u32 = 2
llama_model_loader: - kv 23: glm-dsa.attention.key_length u32 = 576
llama_model_loader: - kv 24: glm-dsa.attention.value_length u32 = 512
llama_model_loader: - kv 25: general.file_type u32 = 32
llama_model_loader: - kv 26: glm-dsa.leading_dense_block_count u32 = 3
llama_model_loader: - kv 27: glm-dsa.vocab_size u32 = 154880
llama_model_loader: - kv 28: glm-dsa.attention.q_lora_rank u32 = 2048
llama_model_loader: - kv 29: glm-dsa.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 30: glm-dsa.attention.key_length_mla u32 = 256
llama_model_loader: - kv 31: glm-dsa.attention.value_length_mla u32 = 256
llama_model_loader: - kv 32: glm-dsa.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 33: glm-dsa.expert_count u32 = 256
llama_model_loader: - kv 34: glm-dsa.expert_shared_count u32 = 1
llama_model_loader: - kv 35: glm-dsa.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 36: glm-dsa.expert_weights_norm bool = true
llama_model_loader: - kv 37: glm-dsa.rope.dimension_count u32 = 64
llama_model_loader: - kv 38: glm-dsa.nextn_predict_layers u32 = 1
llama_model_loader: - kv 39: glm-dsa.attention.indexer.head_count u32 = 32
llama_model_loader: - kv 40: glm-dsa.attention.indexer.key_length u32 = 128
llama_model_loader: - kv 41: glm-dsa.attention.indexer.top_k u32 = 2048
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 44: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 45: tokenizer.ggml.tokens arr[str,154880] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 46: tokenizer.ggml.token_type arr[i32,154880] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 47: tokenizer.ggml.merges arr[str,321649] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 48: tokenizer.ggml.eos_token_id u32 = 154820
llama_model_loader: - kv 49: tokenizer.ggml.padding_token_id u32 = 154820
llama_model_loader: - kv 50: tokenizer.ggml.bos_token_id u32 = 154822
llama_model_loader: - kv 51: tokenizer.ggml.eot_token_id u32 = 154827
llama_model_loader: - kv 52: tokenizer.ggml.unknown_token_id u32 = 154820
llama_model_loader: - kv 53: tokenizer.ggml.eom_token_id u32 = 154829
llama_model_loader: - kv 54: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 55: split.no u16 = 0
llama_model_loader: - kv 56: split.count u16 = 33
llama_model_loader: - kv 57: split.tensors.count i32 = 1809
llama_model_loader: - type f32: 630 tensors
llama_model_loader: - type bf16: 1179 tensors
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 154820 ('<|endoftext|>')
load: - 154827 ('<|user|>')
load: - 154829 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9811 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = glm-dsa
llm_load_print_meta: n_ctx_train = 202752
llm_load_print_meta: n_embd = 6144
llm_load_print_meta: n_layer = 79
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 64
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 16384
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 12288
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 202752
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 744B.A40B
llm_load_print_meta: model ftype = BF16
llm_load_print_meta: model params = 753.864 B
llm_load_print_meta: model size = 1404.406 GiB (16.003 BPW)
llm_load_print_meta: repeating layers = 1400.861 GiB (16.003 BPW, 751.961 B parameters)
llm_load_print_meta: general.name = GLM 5
llm_load_print_meta: n_layer_dense_lead = 3
llm_load_print_meta: n_lora_q = 2048
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.0000
print_info: vocab type = BPE
print_info: n_vocab = 154880
print_info: n_merges = 321649
print_info: BOS token = 154822 '[gMASK]'
print_info: EOS token = 154820 '<|endoftext|>'
print_info: EOT token = 154827 '<|user|>'
print_info: EOM token = 154829 '<|observation|>'
print_info: UNK token = 154820 '<|endoftext|>'
print_info: PAD token = 154820 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 154838 '<|code_prefix|>'
print_info: FIM SUF token = 154840 '<|code_suffix|>'
print_info: FIM MID token = 154839 '<|code_middle|>'
print_info: EOG token = 154820 '<|endoftext|>'
print_info: EOG token = 154827 '<|user|>'
print_info: EOG token = 154829 '<|observation|>'
print_info: max token length = 1024
llm_load_tensors: ggml ctx size = 0.72 MiB
model has unused tensor blk.78.attn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a_norm.weight (size = 8192 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_norm.weight (size = 2048 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.attn_q_b.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_mqa.weight (size = 7077888 bytes) -- ignoring
model has unused tensor blk.78.attn_output.weight (size = 201326592 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.proj.weight (size = 393216 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_k.weight (size = 1572864 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_q_b.weight (size = 16777216 bytes) -- ignoring
model has unused tensor blk.78.ffn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_inp.weight (size = 6291456 bytes) -- ignoring
model has unused tensor blk.78.exp_probs_b.bias (size = 1024 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_exps.weight (size = 6442450944 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_shexp.weight (size = 25165824 bytes) -- ignoring
model has unused tensor blk.78.nextn.eh_proj.weight (size = 150994944 bytes) -- ignoring
model has unused tensor blk.78.nextn.enorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.shared_head_norm.weight (size = 24576 bytes) -- ignoring
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/80 layers to GPU
llm_load_tensors: CPU buffer size = 1419153.34 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 79 wkv_b tensors
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
Computed blk.0.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.1.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.2.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.3.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.4.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.5.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.6.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.7.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.8.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.9.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.10.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.11.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.12.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.13.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.14.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.15.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.16.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.17.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.18.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.19.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.20.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.21.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.22.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.23.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.24.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.25.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.26.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.27.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.28.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.29.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.30.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.31.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.32.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.33.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.34.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.35.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.36.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.37.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.38.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.39.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.40.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.41.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.42.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.43.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.44.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.45.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.46.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.47.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.48.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.49.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.50.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.51.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.52.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.53.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.54.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.55.allama_init_from_model: n_ctx = 4096
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 351.00 MiB
llama_init_from_model: KV self size = 351.00 MiB, c^KV (f16): 351.00 MiB, kv^T: not used
llama_init_from_model: CPU output buffer size = 4.73 MiB
llama_init_from_model: CPU compute buffer size = 2516.00 MiB
llama_init_from_model: graph nodes = 4322
llama_init_from_model: graph splits = 1
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
system_info: n_threads = 160 (n_threads_batch = 192) / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 806.999 ms
perplexity: calculating perplexity over 565 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 30.39 seconds per pass - ETA 35.77 minutes
ttn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.56.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.57.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.58.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.59.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.61.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.62.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.63.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.64.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.65.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.66.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.67.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.68.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.69.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.70.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.71.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.72.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.73.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.74.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.75.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.76.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.77.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
Computed blk.78.attn_kv_b.weight as 512 x 28672 and stored in buffer CPU
===================================== llama_init_from_model: f16
======================================= HAVE_FANCY_SIMD is defined
[1]1.2323,[2]1.9955,[3]1.7660,[4]1.5587,[5]1.4515,[6]1.3944,[7]1.3624,[8]1.3339,[9]1.3250,[10]1.3046,[11]1.2953,[12]1.3352,[13]1.3332,[14]1.3939,[15]1.4936,[16]1.5983,[17]1.7117,[18]1.8544,[19]1.8493,[20]1.8449,[21]1.9193,[22]1.9402,[23]1.9294,[24]1.9136,[25]1.9017,[26]1.9003,[27]1.9125,[28]1.9398,[29]1.9552,[30]2.0111,[31]2.0660,[32]2.1039,[33]2.1476,[34]2.1767,[35]2.2220,[36]2.2603,[37]2.2864,[38]2.3722,[39]2.4110,[40]2.4555,[41]2.5189,[42]2.5115,[43]2.5278,[44]2.5554,[45]2.6261,[46]2.6770,[47]2.6392,[48]2.6017,[49]2.5755,[50]2.5630,[51]2.5790,[52]2.6066,[53]2.6430,[54]2.6685,[55]2.6946,[56]2.7218,[57]2.7202,[58]2.7435,[59]2.7584,[60]2.7930,[61]2.8250,[62]2.8721,[63]2.9093,[64]2.9362,[65]2.9533,[66]2.9495,[67]2.9246,[68]2.9075,[69]2.9296,[70]2.9142,[71]2.8983,[72]2.8982,[73]2.9047,[74]2.9300,[75]2.9327,[76]2.8994,[77]2.8663,[78]2.8372,[79]2.8100,[80]2.7875,[81]2.7640,[82]2.7522,[83]2.7503,[84]2.7261,[85]2.7136,[86]2.7047,[87]2.6949,[88]2.6780,[89]2.6583,[90]2.6464,[91]2.6281,[92]2.6078,[93]2.5970,[94]2.5820,[95]2.5686,[96]2.5604,[97]2.5682,[98]2.5607,[99]2.5472,[100]2.5284,[101]2.5370,[102]2.5224,[103]2.5127,[104]2.5063,[105]2.5162,[106]2.5392,[107]2.5887,[108]2.5989,[109]2.6081,[110]2.6419,[111]2.6646,[112]2.6445,[113]2.6324,[114]2.6316,[115]2.6293,[116]2.6349,[117]2.6363,[118]2.6405,[119]2.6452,[120]2.6418,[121]2.6332,[122]2.6365,[123]2.6242,[124]2.6237,[125]2.6251,[126]2.6255,[127]2.6244,[128]2.6382,[129]2.6443,[130]2.6434,[131]2.6554,[132]2.6550,[133]2.6534,[134]2.6673,[135]2.6846,[136]2.6782,[137]2.6739,[138]2.6700,[139]2.6586,[140]2.6720,[141]2.6739,[142]2.6658,[143]2.6646,[144]2.6655,[145]2.6647,[146]2.6597,[147]2.6476,[148]2.6422,[149]2.6384,[150]2.6354,[151]2.6285,[152]2.6277,[153]2.6313,[154]2.6310,[155]2.6315,[156]2.6354,[157]2.6379,[158]2.6401,[159]2.6507,[160]2.6597,[161]2.6656,[162]2.6557,[163]2.6441,[164]2.6472,[165]2.6382,[166]2.6356,[167]2.6481,[168]2.6484,[169]2.6716,[170]2.6868,[171]2.6973,[172]2.7149,[173]2.7066,[174]2.6949,[175]2.6827,[176]2.6718,[177]2.6590,[178]2.6456,[179]2.6350,[180]2.6234,[181]2.6184,[182]2.6319,[183]2.6488,[184]2.6725,[185]2.6898,[186]2.7002,[187]2.7187,[188]2.7413,[189]2.7625,[190]2.7781,[191]2.7931,[192]2.8025,[193]2.8092,[194]2.8138,[195]2.8119,[196]2.8142,[197]2.8272,[198]2.8416,[199]2.8416,[200]2.8485,[201]2.8510,[202]2.8543,[203]2.8531,[204]2.8615,[205]2.8697,[206]2.8765,[207]2.8833,[208]2.8839,[209]2.8869,[210]2.8829,[211]2.8873,[212]2.8886,[213]2.8929,[214]2.8979,[215]2.9013,[216]2.9057,[217]2.9101,[218]2.9178,[219]2.9130,[220]2.9120,[221]2.9100,[222]2.9126,[223]2.9124,[224]2.9189,[225]2.9212,[226]2.9275,[227]2.9253,[228]2.9249,[229]2.9162,[230]2.9085,[231]2.9046,[232]2.9049,[233]2.9034,[234]2.8968,[235]2.8866,[236]2.8795,[237]2.8716,[238]2.8745,[239]2.8886,[240]2.9030,[241]2.9150,[242]2.9255,[243]2.9374,[244]2.9500,[245]2.9637,[246]2.9746,[247]2.9879,[248]2.9986,[249]3.0000,[250]3.0002,[251]2.9900,[252]2.9813,[253]2.9741,[254]2.9710,[255]2.9732,[256]2.9725,[257]2.9681,[258]2.9668,[259]2.9578,[260]2.9518,[261]2.9452,[262]2.9399,[263]2.9340,[264]2.9298,[265]2.9257,[266]2.9231,[267]2.9162,[268]2.9101,[269]2.9062,[270]2.9048,[271]2.9026,[272]2.8979,[273]2.8954,[274]2.8875,[275]2.8806,[276]2.8701,[277]2.8626,[278]2.8535,[279]2.8548,[280]2.8580,[281]2.8616,[282]2.8667,[283]2.8718,[284]2.8733,[285]2.8746,[286]2.8822,[287]2.8928,[288]2.8942,[289]2.8954,[290]2.9000,[291]2.9027,[292]2.8988,[293]2.8906,[294]2.8852,[295]2.8832,[296]2.8771,[297]2.8728,[298]2.8685,[299]2.8644,[300]2.8638,[301]2.8628,[302]2.8594,[303]2.8567,[304]2.8528,[305]2.8469,[306]2.8424,[307]2.8447,[308]2.8507,[309]2.8619,[310]2.8531,[311]2.8476,[312]2.8407,[313]2.8372,[314]2.8331,[315]2.8319,[316]2.8295,[317]2.8284,[318]2.8279,[319]2.8250,[320]2.8231,[321]2.8253,[322]2.8260,[323]2.8204,[324]2.8168,[325]2.8152,[326]2.8128,[327]2.8148,[328]2.8134,[329]2.8135,[330]2.8127,[331]2.8089,[332]2.8107,[333]2.8133,[334]2.8163,[335]2.8164,[336]2.8174,[337]2.8187,[338]2.8192,[339]2.8192,[340]2.8218,[341]2.8244,[342]2.8261,[343]2.8316,[344]2.8364,[345]2.8451,[346]2.8449,[347]2.8382,[348]2.8320,[349]2.8272,[350]2.8213,[351]2.8150,[352]2.8126,[353]2.8092,[354]2.8035,[355]2.7984,[356]2.7946,[357]2.7895,[358]2.7845,[359]2.7837,[360]2.7790,[361]2.7730,[362]2.7670,[363]2.7619,[364]2.7601,[365]2.7560,[366]2.7535,[367]2.7487,[368]2.7430,[369]2.7386,[370]2.7365,[371]2.7324,[372]2.7324,[373]2.7316,[374]2.7334,[375]2.7305,[376]2.7268,[377]2.7233,[378]2.7215,[379]2.7225,[380]2.7178,[381]2.7150,[382]2.7119,[383]2.7146,[384]2.7210,[385]2.7258,[386]2.7335,[387]2.7382,[388]2.7439,[389]2.7517,[390]2.7543,[391]2.7476,[392]2.7421,[393]2.7359,[394]2.7353,[395]2.7298,[396]2.7253,[397]2.7194,[398]2.7130,[399]2.7079,[400]2.7023,[401]2.6961,[402]2.6906,[403]2.6844,[404]2.6780,[405]2.6727,[406]2.6663,[407]2.6604,[408]2.6543,[409]2.6497,[410]2.6440,[411]2.6389,[412]2.6351,[413]2.6317,[414]2.6300,[415]2.6274,[416]2.6250,[417]2.6201,[418]2.6147,[419]2.6202,[420]2.6160,[421]2.6141,[422]2.6161,[423]2.6136,[424]2.6093,[425]2.6059,[426]2.6036,[427]2.6019,[428]2.5990,[429]2.5947,[430]2.5914,[431]2.5926,[432]2.5890,[433]2.5852,[434]2.5821,[435]2.5788,[436]2.5740,[437]2.5688,[438]2.5648,[439]2.5640,[440]2.5608,[441]2.5589,[442]2.5548,[443]2.5603,[444]2.5678,[445]2.5659,[446]2.5650,[447]2.5673,[448]2.5690,[449]2.5750,[450]2.5764,[451]2.5785,[452]2.5825,[453]2.5899,[454]2.5952,[455]2.5982,[456]2.6037,[457]2.6025,[458]2.6064,[459]2.6089,[460]2.6154,[461]2.6214,[462]2.6247,[463]2.6249,[464]2.6238,[465]2.6232,[466]2.6280,[467]2.6276,[468]2.6250,[469]2.6305,[470]2.6321,[471]2.6348,[472]2.6381,[473]2.6400,[474]2.6417,[475]2.6438,[476]2.6465,[477]2.6500,[478]2.6525,[479]2.6550,[480]2.6571,[481]2.6606,[482]2.6627,[483]2.6658,[484]2.6632,[485]2.6676,[486]2.6698,[487]2.6760,[488]2.6812,[489]2.6869,[490]2.6865,[491]2.6921,[492]2.6970,[493]2.7009,[494]2.7056,[495]2.7110,[496]2.7111,[497]2.7126,[498]2.7146,[499]2.7171,[500]2.7204,[501]2.7217,[502]2.7235,[503]2.7285,[504]2.7339,[505]2.7348,[506]2.7351,[507]2.7368,[508]2.7405,[509]2.7463,[510]2.7493,[511]2.7540,[512]2.7488,[513]2.7442,[514]2.7394,[515]2.7362,[516]2.7334,[517]2.7309,[518]2.7274,[519]2.7232,[520]2.7214,[521]2.7181,[522]2.7142,[523]2.7111,[524]2.7137,[525]2.7111,[526]2.7077,[527]2.7077,[528]2.7056,[529]2.7020,[530]2.6989,[531]2.6965,[532]2.6953,[533]2.6930,[534]2.6921,[535]2.6900,[536]2.6881,[537]2.6835,[538]2.6798,[539]2.6760,[540]2.6756,[541]2.6753,[542]2.6732,[543]2.6716,[544]2.6713,[545]2.6695,[546]2.6693,[547]2.6663,[548]2.6641,[549]2.6614,[550]2.6572,[551]2.6526,[552]2.6489,[553]2.6450,[554]2.6413,[555]2.6371,[556]2.6337,[557]2.6295,[558]2.6293,[559]2.6263,[560]2.6254,[561]2.6261,[562]2.6265,[563]2.6294,[564]2.6314,[565]2.6298,
llama_print_timings: load time = 568847.84 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 2025714.03 ms / 289280 tokens ( 7.00 ms per token, 142.80 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 2036661.57 ms / 289281 tokens
Final estimate: PPL over 565 chunks for n_ctx=512 = 2.6298 +/- 0.01396
Okay, I'll start imatrix on the BF16 and try to to get out at least one quant today for further easier testing. lol gonna take a while, as i've disabled all the fused ops and using |
|
Okay, looking good on cpu-only in my testing. I commented on the Issue that some GGUFs are coming in now: https://huggingface.co/ubergarm/GLM-5-GGUF |
|
UD-Q2_K_XL is working for me with Six RTX 3090's + most of the model on the CPU. |
|
Made a IQ1_S_R4 quant for 128G RAM & 24G VRAM. Using |
|
smol-IQ1_KT full offload rtx 3090. Everything is working. |
You see the difference between But yes, |
|
Thank you everybody for testing, I'll merge it. |
|
It seems like a smart model capable of
I could probably be running it better, but it seemd like all the kv-cache went onto a single GPU. It locked up trying to start it using 👈 DetailsAMD Thread Ripper Pro (Zen 4) 7965WX 24x Core 8x32GiB DDR5@4800 (221.41 GB/s via mlc) + Dual RTX A6000 (48GB VRAM each)\nDriver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!./build/bin/llama-sweep-bench \
--model "$model" \
--ctx-size 131072 \
-ctk q8_0 \
-ger \
--merge-qkv \
-mla 3 -amb 2048 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(64|65|66|67|68|69|70|71|72|73|74|75|76|77)\.ffn_(gate|up|down)_exps.*=CUDA1" \
--cpu-moe \
-ub 4096 -b 4096 \
--threads 24 \
--no-mmap \
--warmup-batch \
--n-predict 64
AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (267.74 GB/s via mlc)numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 131072 \
-ctk q8_0 \
-ger \
--merge-qkv \
-mla 3 \
--threads 92 \
--threads-batch 128 \
-ub 4096 -b 4096 \
--no-mmap \
--numa numactl \
--warmup-batch \
--n-predict 64
Off topic I had to tease Ling-2.5-1T about their A63B haha... |
Yeah, the only way out of it is 24 channel DDR5 (2x12 EPYC) ... [EDIT]: or, perhaps 16x3090 haha |
|
Threadripper PRO 3975wx, DDR4 (8x32) 3000 MT/s ("configured" frequency as reported by How come the decode is significantly better than yours? [EDIT]: Ah! I see! Someone recently asked me why am I saying that the partial offload of MLA MoE results in drop of performance. Well, that's why. I also ran the perplexity test for q8_0 kv with khad. It matches your results. *(yours is 3.0217) So you have two beefy GPUs with 48GB each and you thought that you could boost the decode by offloading some layers to the free VRAM of each GPU, isn't it? [EDIT2]: I see about zero difference in performance between [EDIT3]: |
Hello! Is it possible for you to share that quant on Huggingface? I would really like to try it if possible. Thank you! |
Apparently its a garbage LLM (see the test from xCreate on youtube). Not worth time investing. |
|
@InfernalDread I uploaded it to https://huggingface.co/sokann/GLM-5-GGUF-1.594bpw, no README yet 😅 |
Thank you very much! Can't wait to try it out and see if this model can still pack a punch with this much quantization! |

As in the mainline PR no DSA and no MTP.
It just reuses the DeepSeek2 arch.
Closes #1265