support Glm4MoeLite by ddh0 · Pull Request #18936 · ggml-org/llama.cpp

ddh0 · 2026-01-19T16:44:53Z

Support for Glm4MoeLiteForCausalLM, which seems to be just a renamed version of DeepseekV3 with some code moved around. Thanks to @ngxson for the help.

ref:

ddh0 · 2026-01-19T16:45:10Z

Would close #18931

convert_hf_to_gguf_update.py

fizzAI · 2026-01-19T16:58:08Z

its actually a renamed version of GLM4Moe with DeepseekV3Attention (uses MLA) and an added dense expert at the start, but i'm not sure if any of that is relevant here

ngxson · 2026-01-19T17:02:49Z

with DeepseekV3Attention (uses MLA) and an added dense expert at the start

deepseekv3 also have some dense layer at the start, so unless I missed something, it seems like just deepseek renamed (not GLM4Moe renamed)

fizzAI · 2026-01-19T17:04:36Z

oh i didn't know mb, i guess just the logic for picking dense layers is different. in that case then yeah from the GGUF perspective it's just renamed deepseek

ddh0 · 2026-01-19T18:04:19Z

Freshly converted, imatrix'd from Q8_0, and quantized to IQ4_XS. Output is coherent for multi-turn. PPL for imatrix is a bit high at ~15 - 17 but I think this is correct now. Should be ready for review.

ubergarm · 2026-01-19T18:13:23Z

Output is coherent for multi-turn. PPL for imatrix is a bit high at ~15 - 17 but I think this is correct now. Should be ready for review.

Thanks for sharing this quickly! I can confirm that I converted the safetensors to bf16 GGUF and it runs on 2x RTX A6000's with coherent multi-turn chat.

imatrix with my usual corpus on the bf16 ended up at Final estimate: PPL = 7.1406 +/- 0.04441, and when running your https://huggingface.co/ddh0/imatrices/raw/main/ddh0_imat_calibration_data_v2.txt on the bf16 i get comparable to your q8_0 results: Final estimate: PPL = 15.5197 +/- 0.33298

👈 convert command

$ python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/ \
    /mnt/raid/models/zai-org/GLM-4.7-Flash/

👈 imatrix command

model=/mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-64x2.6B-BF16-00001-of-00002.gguf

    #-f ddh0_imat_calibration_data_v2.txt \

./build/bin/llama-imatrix \
    --model "$model"\
    -f ubergarm-imatrix-calibration-corpus-v02.txt \
    -o /mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/imatrix-GLM-4.7-Flash-BF16.dat \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    -fit off \
    -ngl 99 \
    -ts 40,48 \
    --threads 1 \
    --no-mmap \
    --output-format dat

Only odd thing I saw was this warning disabling flash attention even with -fit off hrmm:

llama_context: layer 0 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled

ddh0 · 2026-01-19T18:14:37Z

I assume flash attention being disabled is due to MLA? Not sure

convert_hf_to_gguf.py

ngxson

Nice 🚀 (can be merged once @CISC approves)

ddh0 · 2026-01-19T19:09:17Z

I don't have permissions to merge it myself, can @ngxson or @CISC hit the button for me please?

Thanks both of you for your help! It's easy to add new models when they're just renamed existing models :)

CISC · 2026-01-19T19:10:52Z

I don't have permissions to merge it myself, can @ngxson or @CISC hit the button for me please?

Will do once CI finishes.

BTW, we should consider including MTP layer in DeepSeek models soon.

EvilFreelancer · 2026-01-19T19:33:39Z

I suspect glm4 tokenizer mismatch in GLM-4.7-Flash, cause its vocab is 154,856 vs GLM-4.5-Air vocab 151,365, i.e. +3,491 tokens and non-matching some special token IDs.

The GLM-4.7-Flash model converted under the glm4 identifier on infenrence stage goes into infinite generation (likely EOS/stop IDs don’t match, so stop conditions never trigger) and also doesn’t emit/recognize the tag, which I guess points to incorrect special-token mapping.

Probably need some additional testing.

ddh0 · 2026-01-19T19:37:40Z

The glm4 is only the pre-tokenizer, the actual tokenizer will work just like any other model, as far as I understand. LMK if you find a concrete issue.

Aaryan-Kapoor · 2026-01-19T19:56:46Z

This might need additional testing.

Following up with @EvilFreelancer's concern - this might be a real bug. See: https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/discussions/2

Wait - looking at `User` (the first line of my internal thought block)...
Ah, I see. In this specific simulation/turn-based interface:
1. User says "hello"
2. Model responds with weirdness
3. **Current Turn**: The user is sending the exact same command to a new model instance or just repeating it because they didn't get an answer?

ddh0 · 2026-01-19T20:02:20Z

This might need additional testing.

Following up with @EvilFreelancer's concern - this might be a real bug. See: https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/discussions/2

GLM-4.7-Flash's <|user|> token is declared as EOS in the original tokenizer config (to stop generation at end of assistant turn), but the GGUF only sets <|endoftext|> as EOS. The model doesn't stop at <|user|>, so it generates past its turn and loops into self-conversation:
Wait - looking at `User` (the first line of my internal thought block)...
Ah, I see. In this specific simulation/turn-based interface:
1. User says "hello"
2. Model responds with weirdness
3. **Current Turn**: The user is sending the exact same command to a new model instance or just repeating it because they didn't get an answer?

Does the issue occur with GGUFs generated with this PR? I did not upload those other ones, and cannot vouch for them.

EvilFreelancer · 2026-01-19T20:04:14Z

@ddh0 you may try https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF

ddh0 · 2026-01-19T20:05:42Z

@ddh0 you may try https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF

You misunderstand. I'm saying that you should not test with those GGUFs since they were not made with this PR. The issue might be related, or it might not. I would like to narrow it down, so please try with GGUFs generated with this PR.

arch-btw · 2026-01-19T21:26:30Z

@ngxson I don't see it in the Jinja.

ngxson · 2026-01-19T21:27:42Z

then it should be irrelevant if chat template doesn't use it

ngxson · 2026-01-19T21:27:51Z

the new GGUF should works:

noctrex · 2026-01-19T21:36:06Z

Confirm, it works now.

ddh0 · 2026-01-19T21:57:39Z

Confirmed working with the latest changes (though, it was also working fine for me before). Marked as ready for review again.

EvilFreelancer · 2026-01-19T22:04:19Z

Confirm, let's merge )

CISC · 2026-01-19T22:13:53Z

💯 Nice work everyone! 🎉

ddh0 · 2026-01-19T22:19:44Z

Odd, because this doesn't happen when running big Deepseek with MLA. Perhaps there aren't CUDA flash MLA kernels configured for these tensor dimensions or something?

~~For the record @DocShotgun, flash attention is working fine for me. Not sure what going on earlier but seems OK now. Feel free to open a new issue and link it here if there are still problems.~~

Nevermind, it does seem that some operations are not supported on CUDA... this problem is above my pay-grade at this point, it will need to be addressed in another PR :(

ggerganov · 2026-01-20T11:20:49Z

Quick question about interleaved/preserved thinking:
If I pass a kwarg to the chat template “clear_thinking”: false, would this work to enable "preserved" thinking? Do we need changes to the client also? If yes, should we support this in the llama-server WebUI?

cc @ngxson @aldehir @pwilkin

ngxson · 2026-01-20T11:56:01Z

@ggerganov there was a discussion around this: #18368

webui change is needed because the reasoning_content must be put back into the message object, as a dedicated field.

but as mentioned in the discussion, it can be a bit tricky as not all models support this. we can probably extend the /props to reflect if the model support putting back the reasoning into message, and show the option "Preserve reasoning" in webui

ggerganov · 2026-01-20T17:15:21Z

@ngxson Btw, regarding #18936 (comment), I think there is still an issue. Consider this short conversation consisting of 2 messages:

When I dump the last prompt from the second message, I see this:

curl http://127.0.0.1:8013/slots?model="glm-4.7-flash-hf" | jq .[2]

{
  ...
  "prompt": "[gMASK]<sop><|user|>Say \"hi\"<|assistant|></think>Hello! How can I help you today?<|user|>Say \"bye\"<|assistant|><think>",
  ...                                                     ^ problem here?
}

Notice that there isn't a corresponding <think> token for the assistant response of the first message. This look strange - not sure if it is necessarily wrong, but to me it looks bogus.

ngxson · 2026-01-20T17:17:40Z

It's actually a feature of the chat template (I know that looks strange, but I was surprised too):

{%- if add_generation_prompt -%}
    <|assistant|>{{- '</think>' if (enable_thinking is defined and not enable_thinking) else '<think>' -}}
{%- endif -%}

For ref, chat template of GLM 4.6 doesn't do that:

{%- if add_generation_prompt -%}
    <|assistant|>{{- '\n<think></think>' if (enable_thinking is defined and not enable_thinking) else '' -}}
{%- endif -%}

ngxson · 2026-01-20T17:20:00Z

Note that for messages that are not the last one, they also skip the opening tag <think>:

{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
{{ '<think>' + reasoning_content.strip() +  '</think>'}}
{%- else -%}
{{ '</think>' }}
{%- endif -%}

ggerganov · 2026-01-20T17:23:35Z

Note that for messages that are not the last one, they also skip the opening tag :

Hm yes, looks like it is intentional. Weird.

Anyway, I encountered a situation where during a longer conversation the model didn't close it's reasoning with </think> before generating the final response, so I started looking why that could be. Thought this might be related, but maybe there is something else going on. Thanks.

noctrex · 2026-01-20T17:30:17Z

Seems to be quantization sensitive. I created this quant and it seems to perform great:
https://huggingface.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF
I even use it in Kilo Code and it works. Used the prompt:
write python code to import a csv, get the data type for each column using low memory = false, for fields that contain text summarize the number of records per value, for numbers give the average min and max, don't do anything with dates or other fields
And it created the script, a test file, a readme, and tested it, and it actually works.

ngxson · 2026-01-20T17:31:46Z

@ggerganov Thanks for looking into this. Indeed, I couldn't test compare logprobs with vLLM because the model hasn't been properly supported by vLLM (would appreciate if someone can do)

I compared line-by-line the HF implementation of this vs deepseek_v3 but so far I couldn't find any differences that could affect the result.

ggerganov · 2026-01-20T17:37:03Z

Yeah, we should definitely check against a reference implementation to make sure. Overall it performs quite well so far - been running it today for a while. But this was one-off surprise where the reasoning was not properly closed. I would guess it's some tokenizer/template issue, but will do more experiments to see if I can narrow it down.

@noctrex I am using the full precision so it's not a quantization issue.

noctrex · 2026-01-20T17:42:14Z

@ggerganov Weird, as I've seen reports of users that try out the Q4 and below quants reporting strange issues, such as starting to think in chinese or thinking endlessly, and just returning gibberish answers.
Anyway, Thanks for your great work!

pwilkin · 2026-01-20T19:13:12Z

@ggerganov you can use the autoparser branch, it has support for reasoning and tool calling for this one properly already :>

ngxson · 2026-01-20T19:14:05Z

Ok so thanks to @bartowski1182 I got the logprobs from vllm vs F16 on llama.cpp. There are indeed some differences, and I just realized that since the prompt is repeated in compare_logprobs.py, it makes the long context tokens pretty much predictable (the script will need to be fixed)

Looking deeper into the code now..

idx	logits_llama.log	logprob_1	logits_other.log	logprob_2	diff (abs)
1	'S'	-2.6846	'\'	-2.5100	0.1746
2	' here'	-0.8800	' here'	-1.6765	0.7964
3	' AI'	-0.6684	' AI'	-0.9069	0.2385
4	' AI'	-0.5882	' AI'	-0.9261	0.3379
5	' assistant'	-0.4866	' assistant'	-0.6661	0.1795
6	' designed'	-1.1536	' designed'	-1.3800	0.2264
7	' of'	-0.0046	' of'	-0.0041	0.0004
8	'S'	-2.6461	'\'	-2.7089	0.0628
9	' tools'	-2.0654	' the'	-1.9421	0.1233
10	' to'	-1.0288	'.'	-1.0014	0.0274
1011	' you'	-0.0000	' you'	-0.0000	0.0000
1012	' need'	-0.0000	' need'	-0.0000	0.0000
1013	' to'	-0.0001	' to'	-0.0000	0.0001
1014	'\n'	-2.1092	' use'	-0.0000	2.1092
1015	' a'	-0.0000	' a'	-0.0000	0.0000
1016	' tool'	-0.0000	' tool'	-0.0000	0.0000
1017	' output'	-0.0000	' output'	-0.0000	0.0000
1018	' the'	-0.0000	' the'	-0.0000	0.0000
1019	' call'	-0.0000	' call'	-0.0000	0.0000
1020	' in'	-0.0000	' in'	-0.0000	0.0000
5021	' requires'	-0.0000	' requires'	-0.0000	0.0000
5022	' external'	-0.0000	' external'	-0.0000	0.0000
5023	' data'	-0.0000	' data'	-0.0000	0.0000
5024	' computation'	-0.0000	' computation'	-0.0000	0.0000
5025	' or'	-0.0000	' or'	-0.0000	0.0000
5026	' actions'	-0.0000	' actions'	-0.0000	0.0000
5027	' beyond'	-0.0000	' beyond'	-0.0000	0.0000
5028	' your'	-0.0000	' your'	-0.0000	0.0000
5029	' internal'	-0.0000	' internal'	-0.0000	0.0000
5030	' knowledge'	-0.0000	' knowledge'	-0.0000	0.0000

aldehir · 2026-01-21T02:54:10Z

but as mentioned in the discussion, it can be a bit tricky as not all models support this. we can probably extend the /props to reflect if the model support putting back the reasoning into message, and show the option "Preserve reasoning" in webui

An option based off /props is certainly a cleaner approach, but I also don't think there's any harm in adding the reasoning content back. If a template doesn't use it, it'll simply be ignored.

@ggerganov Interleaved thinking isn't leveraged in basic chat. This check will prune any reasoning prior to the last user message, and in a chat session that's guaranteed to always prune:

loop.index0 > ns.last_user_index

It's only leveraged during tool calling. Something to consider when implementing MCP support.

initial commit for branch

d32969d

github-actions bot added the python python script changes label Jan 19, 2026

ngxson reviewed Jan 19, 2026

View reviewed changes

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

add glm-4.7-flash, move tokenizer hash

38e4882

use glm4 pretok

eb630d4

loci-dev mentioned this pull request Jan 19, 2026

UPSTREAM PR #18936: support Glm4MoeLite auroralabs-loci/llama.cpp#973

Open

ubergarm mentioned this pull request Jan 19, 2026

Feature Request: Add GLM-4.7-Flash MLA support ikawrakow/ik_llama.cpp#1167

Closed

4 tasks

ddh0 marked this pull request as ready for review January 19, 2026 18:04

ddh0 requested a review from CISC as a code owner January 19, 2026 18:04

silence flake8 E302 (CI)

c64f9e0

CISC reviewed Jan 19, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

apply review feedback

354e2b5

ngxson approved these changes Jan 19, 2026

View reviewed changes

CISC approved these changes Jan 19, 2026

View reviewed changes

CISC linked an issue Jan 19, 2026 that may be closed by this pull request

Feature Request: Support Glm4MoeLiteForCausalLM #18931

Closed

4 tasks

ddh0 marked this pull request as ready for review January 19, 2026 21:56

ngxson merged commit 1706a6d into ggml-org:master Jan 19, 2026
6 checks passed

github-actions bot mentioned this pull request Jan 20, 2026

Reddit News Daily 2026-01-20 gitlawr/reddit-daily-news#130

Open

danielvelara mentioned this pull request Jan 20, 2026

Model Request: GLM-4.7-Flash ggml-org/LlamaBarn#47

Closed

ggerganov mentioned this pull request Jan 20, 2026

metal : enable FA for MLA heads #18950

Merged

ngxson mentioned this pull request Jan 21, 2026

server : support preserving reasoning_content in assistant message #18994

Merged

loci-dev mentioned this pull request Jan 21, 2026

UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message auroralabs-loci/llama.cpp#992

Open

ngxson mentioned this pull request Feb 9, 2026

model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) #19460

Merged

loci-dev mentioned this pull request Feb 12, 2026

UPSTREAM PR #19460: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) auroralabs-loci/llama.cpp#1164

Open

Conversation

ddh0 commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026

Uh oh!

Uh oh!

fizzAI commented Jan 19, 2026

Uh oh!

ngxson commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fizzAI commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026

Uh oh!

ubergarm commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026

Uh oh!

Uh oh!

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddh0 commented Jan 19, 2026

Uh oh!

CISC commented Jan 19, 2026

Uh oh!

EvilFreelancer commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aaryan-Kapoor commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026

Uh oh!

EvilFreelancer commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026

Uh oh!

arch-btw commented Jan 19, 2026

Uh oh!

ngxson commented Jan 19, 2026

Uh oh!

ngxson commented Jan 19, 2026

Uh oh!

noctrex commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026

Uh oh!

EvilFreelancer commented Jan 19, 2026

Uh oh!

Uh oh!

CISC commented Jan 19, 2026

Uh oh!

ddh0 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 20, 2026

Uh oh!

ngxson commented Jan 20, 2026

Uh oh!

ggerganov commented Jan 20, 2026

Uh oh!

ngxson commented Jan 20, 2026

Uh oh!

ngxson commented Jan 20, 2026

Uh oh!

ggerganov commented Jan 20, 2026

Uh oh!

noctrex commented Jan 20, 2026

Uh oh!

ngxson commented Jan 20, 2026

Uh oh!

ggerganov commented Jan 20, 2026

Uh oh!

noctrex commented Jan 20, 2026

Uh oh!

ngxson commented Jan 19, 2026 •

edited

Loading

ngxson left a comment •

edited

Loading

ddh0 commented Jan 19, 2026 •

edited

Loading

ddh0 commented Jan 19, 2026 •

edited

Loading

aldehir commented Jan 21, 2026 •

edited

Loading