Skip to content

support Glm4MoeLite#18936

Merged
ngxson merged 9 commits intoggml-org:masterfrom
ddh0:glm4moelite
Jan 19, 2026
Merged

support Glm4MoeLite#18936
ngxson merged 9 commits intoggml-org:masterfrom
ddh0:glm4moelite

Conversation

@ddh0
Copy link
Contributor

@ddh0 ddh0 commented Jan 19, 2026

Support for Glm4MoeLiteForCausalLM, which seems to be just a renamed version of DeepseekV3 with some code moved around. Thanks to @ngxson for the help.

ref:

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

Would close #18931

@github-actions github-actions bot added the python python script changes label Jan 19, 2026
@fizzAI
Copy link

fizzAI commented Jan 19, 2026

its actually a renamed version of GLM4Moe with DeepseekV3Attention (uses MLA) and an added dense expert at the start, but i'm not sure if any of that is relevant here

@ngxson
Copy link
Collaborator

ngxson commented Jan 19, 2026

with DeepseekV3Attention (uses MLA) and an added dense expert at the start

deepseekv3 also have some dense layer at the start, so unless I missed something, it seems like just deepseek renamed (not GLM4Moe renamed)

@fizzAI
Copy link

fizzAI commented Jan 19, 2026

oh i didn't know mb, i guess just the logic for picking dense layers is different. in that case then yeah from the GGUF perspective it's just renamed deepseek

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

Freshly converted, imatrix'd from Q8_0, and quantized to IQ4_XS. Output is coherent for multi-turn. PPL for imatrix is a bit high at ~15 - 17 but I think this is correct now. Should be ready for review.

@ddh0 ddh0 marked this pull request as ready for review January 19, 2026 18:04
@ddh0 ddh0 requested a review from CISC as a code owner January 19, 2026 18:04
@ubergarm
Copy link
Contributor

Output is coherent for multi-turn. PPL for imatrix is a bit high at ~15 - 17 but I think this is correct now. Should be ready for review.

Thanks for sharing this quickly! I can confirm that I converted the safetensors to bf16 GGUF and it runs on 2x RTX A6000's with coherent multi-turn chat.

imatrix with my usual corpus on the bf16 ended up at Final estimate: PPL = 7.1406 +/- 0.04441, and when running your https://huggingface.co/ddh0/imatrices/raw/main/ddh0_imat_calibration_data_v2.txt on the bf16 i get comparable to your q8_0 results: Final estimate: PPL = 15.5197 +/- 0.33298

👈 convert command
$ python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/ \
    /mnt/raid/models/zai-org/GLM-4.7-Flash/
👈 imatrix command
model=/mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-64x2.6B-BF16-00001-of-00002.gguf

    #-f ddh0_imat_calibration_data_v2.txt \

./build/bin/llama-imatrix \
    --model "$model"\
    -f ubergarm-imatrix-calibration-corpus-v02.txt \
    -o /mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/imatrix-GLM-4.7-Flash-BF16.dat \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    -fit off \
    -ngl 99 \
    -ts 40,48 \
    --threads 1 \
    --no-mmap \
    --output-format dat

Only odd thing I saw was this warning disabling flash attention even with -fit off hrmm:

llama_context: layer 0 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

I assume flash attention being disabled is due to MLA? Not sure

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🚀 (can be merged once @CISC approves)

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

I don't have permissions to merge it myself, can @ngxson or @CISC hit the button for me please?

Thanks both of you for your help! It's easy to add new models when they're just renamed existing models :)

@CISC
Copy link
Collaborator

CISC commented Jan 19, 2026

I don't have permissions to merge it myself, can @ngxson or @CISC hit the button for me please?

Will do once CI finishes.

BTW, we should consider including MTP layer in DeepSeek models soon.

@CISC CISC linked an issue Jan 19, 2026 that may be closed by this pull request
4 tasks
@EvilFreelancer
Copy link
Contributor

I suspect glm4 tokenizer mismatch in GLM-4.7-Flash, cause its vocab is 154,856 vs GLM-4.5-Air vocab 151,365, i.e. +3,491 tokens and non-matching some special token IDs.

The GLM-4.7-Flash model converted under the glm4 identifier on infenrence stage goes into infinite generation (likely EOS/stop IDs don’t match, so stop conditions never trigger) and also doesn’t emit/recognize the tag, which I guess points to incorrect special-token mapping.

Probably need some additional testing.

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

The glm4 is only the pre-tokenizer, the actual tokenizer will work just like any other model, as far as I understand. LMK if you find a concrete issue.

@Aaryan-Kapoor
Copy link
Contributor

This might need additional testing.

Following up with @EvilFreelancer's concern - this might be a real bug. See: https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/discussions/2

GLM-4.7-Flash's <|user|> token is declared as EOS in the original tokenizer config (to stop generation at end of assistant turn), but the GGUF only sets <|endoftext|> as EOS. The model doesn't stop at <|user|>, so it generates past its turn and loops into self-conversation:

Wait - looking at `User` (the first line of my internal thought block)...
Ah, I see. In this specific simulation/turn-based interface:
1. User says "hello"
2. Model responds with weirdness
3. **Current Turn**: The user is sending the exact same command to a new model instance or just repeating it because they didn't get an answer?

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

This might need additional testing.

Following up with @EvilFreelancer's concern - this might be a real bug. See: https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/discussions/2

GLM-4.7-Flash's <|user|> token is declared as EOS in the original tokenizer config (to stop generation at end of assistant turn), but the GGUF only sets <|endoftext|> as EOS. The model doesn't stop at <|user|>, so it generates past its turn and loops into self-conversation:

Wait - looking at `User` (the first line of my internal thought block)...
Ah, I see. In this specific simulation/turn-based interface:
1. User says "hello"
2. Model responds with weirdness
3. **Current Turn**: The user is sending the exact same command to a new model instance or just repeating it because they didn't get an answer?

Does the issue occur with GGUFs generated with this PR? I did not upload those other ones, and cannot vouch for them.

@EvilFreelancer
Copy link
Contributor

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

@ddh0 you may try https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF

You misunderstand. I'm saying that you should not test with those GGUFs since they were not made with this PR. The issue might be related, or it might not. I would like to narrow it down, so please try with GGUFs generated with this PR.

@arch-btw
Copy link
Contributor

@ngxson I don't see it in the Jinja.

@ngxson
Copy link
Collaborator

ngxson commented Jan 19, 2026

then it should be irrelevant if chat template doesn't use it

@ngxson
Copy link
Collaborator

ngxson commented Jan 19, 2026

the new GGUF should works:

image

@noctrex
Copy link
Contributor

noctrex commented Jan 19, 2026

Confirm, it works now.

@ddh0 ddh0 marked this pull request as ready for review January 19, 2026 21:56
@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

Confirmed working with the latest changes (though, it was also working fine for me before). Marked as ready for review again.

@EvilFreelancer
Copy link
Contributor

Confirm, let's merge )

@ngxson ngxson merged commit 1706a6d into ggml-org:master Jan 19, 2026
6 checks passed
@CISC
Copy link
Collaborator

CISC commented Jan 19, 2026

💯 Nice work everyone! 🎉

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 19, 2026

Odd, because this doesn't happen when running big Deepseek with MLA. Perhaps there aren't CUDA flash MLA kernels configured for these tensor dimensions or something?

For the record @DocShotgun, flash attention is working fine for me. Not sure what going on earlier but seems OK now. Feel free to open a new issue and link it here if there are still problems.

Nevermind, it does seem that some operations are not supported on CUDA... this problem is above my pay-grade at this point, it will need to be addressed in another PR :(

@ggerganov
Copy link
Member

Quick question about interleaved/preserved thinking:
If I pass a kwarg to the chat template “clear_thinking”: false, would this work to enable "preserved" thinking? Do we need changes to the client also? If yes, should we support this in the llama-server WebUI?

cc @ngxson @aldehir @pwilkin

@ngxson
Copy link
Collaborator

ngxson commented Jan 20, 2026

@ggerganov there was a discussion around this: #18368

webui change is needed because the reasoning_content must be put back into the message object, as a dedicated field.

but as mentioned in the discussion, it can be a bit tricky as not all models support this. we can probably extend the /props to reflect if the model support putting back the reasoning into message, and show the option "Preserve reasoning" in webui

@ggerganov
Copy link
Member

@ngxson Btw, regarding #18936 (comment), I think there is still an issue. Consider this short conversation consisting of 2 messages:

image

When I dump the last prompt from the second message, I see this:

curl http://127.0.0.1:8013/slots?model="glm-4.7-flash-hf" | jq .[2]

{
  ...
  "prompt": "[gMASK]<sop><|user|>Say \"hi\"<|assistant|></think>Hello! How can I help you today?<|user|>Say \"bye\"<|assistant|><think>",
  ...                                                     ^ problem here?
}

Notice that there isn't a corresponding <think> token for the assistant response of the first message. This look strange - not sure if it is necessarily wrong, but to me it looks bogus.

@ngxson
Copy link
Collaborator

ngxson commented Jan 20, 2026

It's actually a feature of the chat template (I know that looks strange, but I was surprised too):

{%- if add_generation_prompt -%}
    <|assistant|>{{- '</think>' if (enable_thinking is defined and not enable_thinking) else '<think>' -}}
{%- endif -%}

For ref, chat template of GLM 4.6 doesn't do that:

{%- if add_generation_prompt -%}
    <|assistant|>{{- '\n<think></think>' if (enable_thinking is defined and not enable_thinking) else '' -}}
{%- endif -%}

@ngxson
Copy link
Collaborator

ngxson commented Jan 20, 2026

Note that for messages that are not the last one, they also skip the opening tag <think>:

{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
{{ '<think>' + reasoning_content.strip() +  '</think>'}}
{%- else -%}
{{ '</think>' }}
{%- endif -%}

@ggerganov
Copy link
Member

Note that for messages that are not the last one, they also skip the opening tag :

Hm yes, looks like it is intentional. Weird.

Anyway, I encountered a situation where during a longer conversation the model didn't close it's reasoning with </think> before generating the final response, so I started looking why that could be. Thought this might be related, but maybe there is something else going on. Thanks.

@noctrex
Copy link
Contributor

noctrex commented Jan 20, 2026

Seems to be quantization sensitive. I created this quant and it seems to perform great:
https://huggingface.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF
I even use it in Kilo Code and it works. Used the prompt:
write python code to import a csv, get the data type for each column using low memory = false, for fields that contain text summarize the number of records per value, for numbers give the average min and max, don't do anything with dates or other fields
And it created the script, a test file, a readme, and tested it, and it actually works.

image image

@ngxson
Copy link
Collaborator

ngxson commented Jan 20, 2026

@ggerganov Thanks for looking into this. Indeed, I couldn't test compare logprobs with vLLM because the model hasn't been properly supported by vLLM (would appreciate if someone can do)

I compared line-by-line the HF implementation of this vs deepseek_v3 but so far I couldn't find any differences that could affect the result.

@ggerganov
Copy link
Member

Yeah, we should definitely check against a reference implementation to make sure. Overall it performs quite well so far - been running it today for a while. But this was one-off surprise where the reasoning was not properly closed. I would guess it's some tokenizer/template issue, but will do more experiments to see if I can narrow it down.

@noctrex I am using the full precision so it's not a quantization issue.

@noctrex
Copy link
Contributor

noctrex commented Jan 20, 2026

@ggerganov Weird, as I've seen reports of users that try out the Q4 and below quants reporting strange issues, such as starting to think in chinese or thinking endlessly, and just returning gibberish answers.
Anyway, Thanks for your great work!

@pwilkin
Copy link
Collaborator

pwilkin commented Jan 20, 2026

@ggerganov you can use the autoparser branch, it has support for reasoning and tool calling for this one properly already :>

@ngxson
Copy link
Collaborator

ngxson commented Jan 20, 2026

Ok so thanks to @bartowski1182 I got the logprobs from vllm vs F16 on llama.cpp. There are indeed some differences, and I just realized that since the prompt is repeated in compare_logprobs.py, it makes the long context tokens pretty much predictable (the script will need to be fixed)

Looking deeper into the code now..

idx logits_llama.log logprob_1 logits_other.log logprob_2 diff (abs)
1 'S' -2.6846 '\' -2.5100 0.1746
2 ' here' -0.8800 ' here' -1.6765 0.7964
3 ' AI' -0.6684 ' AI' -0.9069 0.2385
4 ' AI' -0.5882 ' AI' -0.9261 0.3379
5 ' assistant' -0.4866 ' assistant' -0.6661 0.1795
6 ' designed' -1.1536 ' designed' -1.3800 0.2264
7 ' of' -0.0046 ' of' -0.0041 0.0004
8 'S' -2.6461 '\' -2.7089 0.0628
9 ' tools' -2.0654 ' the' -1.9421 0.1233
10 ' to' -1.0288 '.' -1.0014 0.0274
1011 ' you' -0.0000 ' you' -0.0000 0.0000
1012 ' need' -0.0000 ' need' -0.0000 0.0000
1013 ' to' -0.0001 ' to' -0.0000 0.0001
1014 '\n' -2.1092 ' use' -0.0000 2.1092
1015 ' a' -0.0000 ' a' -0.0000 0.0000
1016 ' tool' -0.0000 ' tool' -0.0000 0.0000
1017 ' output' -0.0000 ' output' -0.0000 0.0000
1018 ' the' -0.0000 ' the' -0.0000 0.0000
1019 ' call' -0.0000 ' call' -0.0000 0.0000
1020 ' in' -0.0000 ' in' -0.0000 0.0000
5021 ' requires' -0.0000 ' requires' -0.0000 0.0000
5022 ' external' -0.0000 ' external' -0.0000 0.0000
5023 ' data' -0.0000 ' data' -0.0000 0.0000
5024 ' computation' -0.0000 ' computation' -0.0000 0.0000
5025 ' or' -0.0000 ' or' -0.0000 0.0000
5026 ' actions' -0.0000 ' actions' -0.0000 0.0000
5027 ' beyond' -0.0000 ' beyond' -0.0000 0.0000
5028 ' your' -0.0000 ' your' -0.0000 0.0000
5029 ' internal' -0.0000 ' internal' -0.0000 0.0000
5030 ' knowledge' -0.0000 ' knowledge' -0.0000 0.0000

@aldehir
Copy link
Collaborator

aldehir commented Jan 21, 2026

but as mentioned in the discussion, it can be a bit tricky as not all models support this. we can probably extend the /props to reflect if the model support putting back the reasoning into message, and show the option "Preserve reasoning" in webui

An option based off /props is certainly a cleaner approach, but I also don't think there's any harm in adding the reasoning content back. If a template doesn't use it, it'll simply be ignored.

@ggerganov Interleaved thinking isn't leveraged in basic chat. This check will prune any reasoning prior to the last user message, and in a chat session that's guaranteed to always prune:

loop.index0 > ns.last_user_index

It's only leveraged during tool calling. Something to consider when implementing MCP support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support Glm4MoeLiteForCausalLM