Skip to content

Conversation

@Thireus
Copy link
Contributor

@Thireus Thireus commented Jul 29, 2025

Thireus and others added 30 commits June 2, 2025 20:09
   Manually-specified variables were not used by the project:

    GGML_BACKEND_DL
    GGML_CPU
   Manually-specified variables were not used by the project:

    GGML_BACKEND_DL
    GGML_CPU
   Manually-specified variables were not used by the project:

    GGML_BACKEND_DL
    GGML_CPU
(╯°□°)╯︵ ┻━┻
…F16=1

-DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
Bump to latest ik_llama.cpp
Check if ffn_up and ffn_gate are of the same type before using fmoe
@Thireus Thireus closed this Jul 29, 2025
@saood06
Copy link
Collaborator

saood06 commented Jul 30, 2025

@Thireus

Did this work? Was the main issue the fact that this PR pulls in unrelated stuff?

@Thireus
Copy link
Contributor Author

Thireus commented Jul 30, 2025

@saood06, more changes to be made and was meant to be a pull request on my fork until tested and working.

@ubergarm
Copy link
Contributor

@Thireus

Thanks for looking into this one, as I've heard some good early reports from folks at TheBeaverAIClub discord.

I saw one tip on the mainline llama.cpp PR linked to SmallThinker PR, where possibly converting the safetensors to fp16 GGUF instead of the usual bf16 might fix something. this despite the original safetensors supposedly being bf16. But total speculation. I'm downloading the GLM4.5-Air now and hope to quantize it eventually.

@Thireus
Copy link
Contributor Author

Thireus commented Jul 31, 2025

@ubergarm @saood06 - I have implemented the llama.cpp PR here: https://github.com/Thireus/ik_llama.cpp/tree/glm-4.5 but it has a known issue which is that it talks nonsense after some time. See: https://github.com/Thireus/ik_llama.cpp/releases/tag/glm-4.5-b4021-83d2bb3

@ubergarm
Copy link
Contributor

@Thireus

Thanks, I'm going to hold off on GLM4.5 for now given the mainline lcpp PR seems to still be having issues.

In the mean time you can keep busy with these: https://www.deepcogito.com/research/cogito-v2-preview

😆 😭 so many models this week!!!

@Thireus
Copy link
Contributor Author

Thireus commented Jul 31, 2025

@ubergarm - yeah i'm going to hold off for now, there are too many models and I still haven't finished calibrating the ones I already sharded. I got myself 2x RTX PRO 6000, so already have new toys to play with. :D

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

@ubergarm, @saood06 - I think it's working fine actually. The issue may have been my broken prompt.

Could you guys check in the web UI?

git clone https://github.com/Thireus/GGUF-Tool-Suite
cd GGUF-Tool-Suite
# Make sure to copy the relevant download.conf for the model before running quant_assign.py
rm -f download.conf
# Use the download.conf of the chosen model
cp -f models/GLM-4.5/download.conf .
echo ".*=q6_K" > q6_K.recipe
mkdir -p kitchen && cd kitchen
../quant_downloader.sh ../q6_K.recipe

Once download finished:

ulimit -n 9999
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 ~/ik_llama-glm-4.5-b4021-83d2bb3-bin-win-cuda-12.8-x64-avx512/llama-server -m GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf  -mla 3 -fa \
  -amb 1024 \
  -fmoe \
  -ctk f16 \
  -c 110592 \
  -ngl 99 \
  -ot "blk\.([0-9]|1[0-9]|2[0-3])\.ffn_.*=CUDA0" -ot "blk\.(2[4-9]|28|29|3[0-9]|4[0-5])\.ffn_.*=CUDA1" -ot "blk\.(4[6-9])\.ffn_.*=CUDA2" --override-tensor exps=CPU \
  -ot exps=CPU \
  -b 4096 -ub 4096 \
  --warmup-batch \
  --no-mmap \
  --threads 36 \
  --main-gpu 0

Make sure you use https://github.com/Thireus/ik_llama.cpp/tree/83d2bb3e2d8a0a630f77b225515a52c48d4fe16b (there is also a release build for Windows)

Example of output:

What is the solution of x+5=-2?

Answer:

<think>I need to solve the equation \( x + 5 = -2 \) for \( x \).

The goal is to isolate \( x \) on one side of the equation. To do that, I should get rid of the +5 that's with the \( x \). The opposite of adding 5 is subtracting 5. So, if I subtract 5 from both sides of the equation, that should cancel out the +5 on the left.

Let me write that down:

Starting with:

\( x + 5 = -2 \)

Subtract 5 from both sides:

\( x + 5 - 5 = -2 - 5 \)

Now, simplify both sides:

On the left, +5 and -5 cancel each other out, so I just have \( x \).

On the right, -2 minus 5. Since both are negative, I can think of it as adding the absolute values and keeping the negative sign. So, -2 - 5 = -7.

Therefore, \( x = -7 \).

I should check if this is correct by plugging it back into the original equation.

If \( x = -7 \), then \( x + 5 = -7 + 5 \).

-7 + 5 is the same as 5 - 7, which is -2. And -2 equals the right side of the equation, which is -2. Perfect, it checks out.

I can also think about it in terms of a number line. If I start at x and add 5 to get to -2, that means x must be 5 units to the left of -2 on the number line. So, from -2, moving left 5 units: -2 to -3 is one, to -4 is two, to -5 is three, to -6 is four, to -7 is five. So, x = -7.

Another way is to use the property of equality. Whatever you do to one side, you must do to the other to keep it balanced. So, subtracting 5 from both sides maintains the equality.

I think I've confirmed it multiple ways. So, the solution is x = -7.</think>To solve the equation \(x + 5 = -2\) for \(x\), follow these steps:

1. **Isolate \(x\)** by subtracting 5 from both sides of the equation to eliminate the +5 on the left side:
   \[
   x + 5 - 5 = -2 - 5
   \]

2. **Simplify both sides**:
   - On the left, \(+5\) and \(-5\) cancel out, leaving \(x\).
   - On the right, \(-2 - 5 = -7\).
   \[
   x = -7
   \]

3. **Verify the solution** by substituting \(x = -7\) back into the original equation:
   \[
   (-7) + 5 = -2
   \]
   Simplifying the left side: \(-7 + 5 = -2\), which matches the right side of the equation (\(-2\)). This confirms the solution is correct.

**Solution:** \(x = -7\)

Also tested on coding abilities and seeing no issues so far. More examples: ggml-org/llama.cpp#14939 (comment)

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

@ubergarm, would you be able to produce an imatrix for https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main please (not the Q8_0) 🙏🙏🙏?

@ddh0
Copy link

ddh0 commented Aug 1, 2025

Hi @Thireus, I notice you're using -mla 3, could you please explain why? As far as I'm aware this model doesn't use MLA, but maybe this is a clue to getting it working on mainline?

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

Hi @Thireus, I notice you're using -mla 3, could you please explain why? As far as I'm aware this model doesn't use MLA, but maybe this is a clue to getting it working on mainline?

Hi @ddh0, a command line parameter that I often use when dealing with DeepSeek models and forgot to remove, nevertheless you will see it gets disabled when you launch llama-server:

...
=====================================================================
 MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
=====================================================================
...

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

That'd be nice if I can get confirmation from someone else that I'm not hallucinating anything here and that the model indeed produces good answers.

@kirnat
Copy link

kirnat commented Aug 1, 2025

Thank you very much. I am testing it now with the q6_K from your example.

No broken responses so far with a few prompts in llama.cpp webui and Cline with create, edit and command line use.

./ik_llama.cpp/build/bin/llama-server -m ~/models/GGUF-Tool-Suite/kitchen/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf \
  -fa \
  -amb 1024 \
  -fmoe \
  -ngl 99 \
  -ot "blk\.([0-9]|1[0-8])\.ffn_.*=CUDA0" \
  -ot exps=CPU \
  -b 4096 -ub 4096 \
  -c 65536 \
  --temp 0.6 \
  --top-p 1.0 \
  --no-mmap \
  --alias GLM-4.5 \
  --threads 52 \
  --host 0.0.0.0 \
  --port 8080

@ubergarm
Copy link
Contributor

ubergarm commented Aug 1, 2025

@Thireus

I see some more chatter on the mainline lcpp PR linking a hugging face issue regarding <think> special token issue: https://huggingface.co/zai-org/GLM-4.5/discussions/9

Does that mean your hf_convert_to_gguf.py on your repo here: https://github.com/Thireus/ik_llama.cpp/tree/83d2bb3e2d8a0a630f77b225515a52c48d4fe16b will work on https://huggingface.co/zai-org/GLM-4.5 to produce GGUF BF16s with the latest tokenizer fix stuff from the mainline PR?

But you have already done that step and uploaded converted GGUF BF16's here: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main which can be quantized and run using your fork?

So you want me to download that bf16 gguf and run imatrix on it (without converting to Q8_0 first) if I understand correctly?

Just catching up slowly and drinking my coffee, so much action this week lol.

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

@ubergarm, yes please use the BF16 I have uploaded for computing the imatrix. You will need to use ulimit -n 9999 for your OS to loft the max opened files limit in the same terminal you run the llama command that computes the imatrix. Thank you so much!

@ubergarm
Copy link
Contributor

ubergarm commented Aug 1, 2025

@Thireus

Okay I'm downloading your BF16, and will try to get your fork going. Is there a different PR to add this architechture here, given this one seems closed?

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

I'll need to add a new PR with clean code only for the changes relevant to this model.

@ubergarm
Copy link
Contributor

ubergarm commented Aug 1, 2025

Okay, so I have a Q8_0 up and running now for testing with some folks, here is how I got there:

git clone [email protected]:Thireus/ik_llama.cpp.git
cd ik_llama.cpp
git checkout 83d2bb3e2d8a0a630f77b225515a52c48d4fe16b

# compile as usual, CPU-only in my case

# quantize a Q8_0 from this BF16 GGUF: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main
$ cat myscripts/quantize-GLM-4.5-v01.sh
#!/usr/bin/env bash

ulimit -n 9999

#numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --pure \
    /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf \
    /mnt/raid/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf \
    Q8_0 \
    192

# run llama-server
$ cat myscripts/api-server-GLM-4.5.sh
#!/usr/bin/env bash

ulimit -n 9999

model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf

numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model "$model"\
    --alias Thireus/GLM-4.5-Thireus-Q8_0.gguf \
    --ctx-size 196608 \
    -fa -fmoe \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 3 \
    --threads 128 \
    --threads-batch 192 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

Next I'll try to make imatrix with this from the bf16 GGUF directly

#!/usr/bin/env bash


ulimit -n 9999

# echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf

#Only the best for Thireus, don't use Q8_0 haha
#model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf

numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
    -m "$model" \
    -f ubergarm-imatrix-calibration-corpus-v02.txt \
    -o /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat \
    --verbosity 1 \
    --layer-similarity \
    --seed 1337 \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192 \
    --no-mmap

...

save_imatrix: entry '               blk.48.ffn_up_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware*
*
save_imatrix: warning: storing only 1000 out of 1012 entries

save_imatrix: stored collected data after 10 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[10]37447.5358,[11]38232.5777,[12]39631.3289,[13]41582.0199,[14]44141.9806,[15]43651.3243,

EDIT hrmm those perplexities are looking super high for the imatrix... i'll restart it and try again using -fa this time...

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

Yes I don't know what is up with llama-perplexity, I've seen the same. Could be fa, please let us know.

@ubergarm
Copy link
Contributor

ubergarm commented Aug 1, 2025

@Thireus

Yes -fa is required, which points to possibly this: #565 (comment)

Did you remove Vcur reshaping from the mainline lcpp implementation? I'll have to try to diff your fork to see or after you open a PR here I can look.

compute_imatrix: tokenizing the input ..                                                                                                      [0/1844]
compute_imatrix: tokenization took 901.12 ms
compute_imatrix: computing over 814 chunks with batch_size 512
compute_imatrix: 10.08 seconds per pass - ETA 2 hours 16.72 minutes
[1]16.8092,[2]6.7403,[3]4.3630,[4]3.1866,[5]2.5865,[6]2.2122,[7]1.9898,[8]1.8430,[9]1.8314,
save_imatrix: entry '             blk.92.ffn_gate_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware*
*
...
save_imatrix: entry '               blk.48.ffn_up_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware*
*

save_imatrix: stored collected data after 10 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[10]1.7538,[11]1.8628,[12]1.9497,[13]2.0128,

I'll let this finish running and upload the resultikng imatrix.dat, but I'm not releasing any quants until we have a PR that is looking good and some more testing. Thanks!

@ubergarm
Copy link
Contributor

ubergarm commented Aug 1, 2025

@Thireus

Yeah I think the issue regarding ik_llama.cpp fork version is that we want to remove Vcur reshaping in llama.cpp -> build_glm4_moe() to fix the non -fa path e.g.:

                // reshape for multi-head
                Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head,    n_tokens);
                Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
                Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); # <--- delete this line

This has nothing to do with any issues possibly still going on with mainline and not really sure what is different in your implementation and mainlines either yet.

@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

Thank you, I'll give it a go, you are far more knowledgeable than me in this domain as I don't know what Vcur is.

Edit: I can confirm this fix works.

Thireus added a commit to Thireus/ik_llama.cpp that referenced this pull request Aug 1, 2025
@Thireus Thireus mentioned this pull request Aug 1, 2025
4 tasks
@Thireus
Copy link
Contributor Author

Thireus commented Aug 1, 2025

I suggest we move this conversation over to #668

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants