-
Couldn't load subscription status.
- Fork 155
Glm 4.5 #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glm 4.5 #662
Conversation
Thireus
commented
Jul 29, 2025
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
Manually-specified variables were not used by the project:
GGML_BACKEND_DL
GGML_CPU
Manually-specified variables were not used by the project:
GGML_BACKEND_DL
GGML_CPU
Manually-specified variables were not used by the project:
GGML_BACKEND_DL
GGML_CPU
(╯°□°)╯︵ ┻━┻
…F16=1 -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
Bump to latest ik_llama.cpp
Bump ik_llama to 8ffad18
Check if ffn_up and ffn_gate are of the same type before using fmoe
This reverts commit ff0c368.
|
Did this work? Was the main issue the fact that this PR pulls in unrelated stuff? |
|
@saood06, more changes to be made and was meant to be a pull request on my fork until tested and working. |
|
Thanks for looking into this one, as I've heard some good early reports from folks at TheBeaverAIClub discord. I saw one tip on the mainline llama.cpp PR linked to SmallThinker PR, where possibly converting the safetensors to |
|
@ubergarm @saood06 - I have implemented the llama.cpp PR here: https://github.com/Thireus/ik_llama.cpp/tree/glm-4.5 but it has a known issue which is that it talks nonsense after some time. See: https://github.com/Thireus/ik_llama.cpp/releases/tag/glm-4.5-b4021-83d2bb3 |
|
Thanks, I'm going to hold off on GLM4.5 for now given the mainline lcpp PR seems to still be having issues. In the mean time you can keep busy with these: https://www.deepcogito.com/research/cogito-v2-preview 😆 😭 so many models this week!!! |
|
@ubergarm - yeah i'm going to hold off for now, there are too many models and I still haven't finished calibrating the ones I already sharded. I got myself 2x RTX PRO 6000, so already have new toys to play with. :D |
|
@ubergarm, @saood06 - I think it's working fine actually. The issue may have been my broken prompt. Could you guys check in the web UI? Once download finished: Make sure you use https://github.com/Thireus/ik_llama.cpp/tree/83d2bb3e2d8a0a630f77b225515a52c48d4fe16b (there is also a release build for Windows) Example of output: Answer: Also tested on coding abilities and seeing no issues so far. More examples: ggml-org/llama.cpp#14939 (comment) |
|
@ubergarm, would you be able to produce an imatrix for https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main please (not the Q8_0) 🙏🙏🙏? |
|
Hi @Thireus, I notice you're using |
Hi @ddh0, a command line parameter that I often use when dealing with DeepSeek models and forgot to remove, nevertheless you will see it gets disabled when you launch llama-server: |
|
That'd be nice if I can get confirmation from someone else that I'm not hallucinating anything here and that the model indeed produces good answers. |
|
Thank you very much. I am testing it now with the q6_K from your example. No broken responses so far with a few prompts in llama.cpp webui and Cline with create, edit and command line use. |
|
I see some more chatter on the mainline lcpp PR linking a hugging face issue regarding Does that mean your But you have already done that step and uploaded converted GGUF BF16's here: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main which can be quantized and run using your fork? So you want me to download that bf16 gguf and run imatrix on it (without converting to Q8_0 first) if I understand correctly? Just catching up slowly and drinking my coffee, so much action this week lol. |
|
@ubergarm, yes please use the BF16 I have uploaded for computing the imatrix. You will need to use ulimit -n 9999 for your OS to loft the max opened files limit in the same terminal you run the llama command that computes the imatrix. Thank you so much! |
|
Okay I'm downloading your BF16, and will try to get your fork going. Is there a different PR to add this architechture here, given this one seems closed? |
|
I'll need to add a new PR with clean code only for the changes relevant to this model. |
|
Okay, so I have a Q8_0 up and running now for testing with some folks, here is how I got there: git clone [email protected]:Thireus/ik_llama.cpp.git
cd ik_llama.cpp
git checkout 83d2bb3e2d8a0a630f77b225515a52c48d4fe16b
# compile as usual, CPU-only in my case
# quantize a Q8_0 from this BF16 GGUF: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main
$ cat myscripts/quantize-GLM-4.5-v01.sh
#!/usr/bin/env bash
ulimit -n 9999
#numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--pure \
/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf \
/mnt/raid/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf \
Q8_0 \
192
# run llama-server
$ cat myscripts/api-server-GLM-4.5.sh
#!/usr/bin/env bash
ulimit -n 9999
model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf
numactl -N 0 -m 0 \
./build/bin/llama-server \
--model "$model"\
--alias Thireus/GLM-4.5-Thireus-Q8_0.gguf \
--ctx-size 196608 \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
--parallel 3 \
--threads 128 \
--threads-batch 192 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmapNext I'll try to make imatrix with this from the bf16 GGUF directly #!/usr/bin/env bash
ulimit -n 9999
# echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf
#Only the best for Thireus, don't use Q8_0 haha
#model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf
numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
-m "$model" \
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat \
--verbosity 1 \
--layer-similarity \
--seed 1337 \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa numactl \
--threads 128 \
--threads-batch 192 \
--no-mmap
...
save_imatrix: entry ' blk.48.ffn_up_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware*
*
save_imatrix: warning: storing only 1000 out of 1012 entries
save_imatrix: stored collected data after 10 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[10]37447.5358,[11]38232.5777,[12]39631.3289,[13]41582.0199,[14]44141.9806,[15]43651.3243,EDIT hrmm those perplexities are looking super high for the imatrix... i'll restart it and try again using |
|
Yes I don't know what is up with llama-perplexity, I've seen the same. Could be fa, please let us know. |
|
Yes Did you remove I'll let this finish running and upload the resultikng imatrix.dat, but I'm not releasing any quants until we have a PR that is looking good and some more testing. Thanks! |
|
Yeah I think the issue regarding ik_llama.cpp fork version is that we want to remove Vcur reshaping in llama.cpp -> build_glm4_moe() to fix the non // reshape for multi-head
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); # <--- delete this lineThis has nothing to do with any issues possibly still going on with mainline and not really sure what is different in your implementation and mainlines either yet. |
|
Thank you, I'll give it a go, you are far more knowledgeable than me in this domain as I don't know what Vcur is. Edit: I can confirm this fix works. |
Suggested by @ubergarm - ikawrakow#662 (comment)
|
I suggest we move this conversation over to #668 |