Conversation
|
Would close #18931 |
|
its actually a renamed version of GLM4Moe with DeepseekV3Attention (uses MLA) and an added dense expert at the start, but i'm not sure if any of that is relevant here |
deepseekv3 also have some dense layer at the start, so unless I missed something, it seems like just deepseek renamed (not GLM4Moe renamed) |
|
oh i didn't know mb, i guess just the logic for picking dense layers is different. in that case then yeah from the GGUF perspective it's just renamed deepseek |
|
Freshly converted, imatrix'd from Q8_0, and quantized to IQ4_XS. Output is coherent for multi-turn. PPL for imatrix is a bit high at ~15 - 17 but I think this is correct now. Should be ready for review. |
Thanks for sharing this quickly! I can confirm that I converted the safetensors to bf16 GGUF and it runs on 2x RTX A6000's with coherent multi-turn chat. imatrix with my usual corpus on the bf16 ended up at 👈 convert command$ python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/ \
/mnt/raid/models/zai-org/GLM-4.7-Flash/👈 imatrix commandmodel=/mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-64x2.6B-BF16-00001-of-00002.gguf
#-f ddh0_imat_calibration_data_v2.txt \
./build/bin/llama-imatrix \
--model "$model"\
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/raid/models/ubergarm/GLM-4.7-Flash-GGUF/imatrix-GLM-4.7-Flash-BF16.dat \
--ctx-size 512 \
-ub 4096 -b 4096 \
-fit off \
-ngl 99 \
-ts 40,48 \
--threads 1 \
--no-mmap \
--output-format datOnly odd thing I saw was this warning disabling flash attention even with |
|
I assume flash attention being disabled is due to MLA? Not sure |
|
I suspect glm4 tokenizer mismatch in GLM-4.7-Flash, cause its vocab is 154,856 vs GLM-4.5-Air vocab 151,365, i.e. +3,491 tokens and non-matching some special token IDs. The GLM-4.7-Flash model converted under the glm4 identifier on infenrence stage goes into infinite generation (likely EOS/stop IDs don’t match, so stop conditions never trigger) and also doesn’t emit/recognize the tag, which I guess points to incorrect special-token mapping. Probably need some additional testing. |
|
The |
|
This might need additional testing. Following up with @EvilFreelancer's concern - this might be a real bug. See: https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/discussions/2 GLM-4.7-Flash's <|user|> token is declared as EOS in the original tokenizer config (to stop generation at end of assistant turn), but the GGUF only sets <|endoftext|> as EOS. The model doesn't stop at <|user|>, so it generates past its turn and loops into self-conversation: |
Does the issue occur with GGUFs generated with this PR? I did not upload those other ones, and cannot vouch for them. |
You misunderstand. I'm saying that you should not test with those GGUFs since they were not made with this PR. The issue might be related, or it might not. I would like to narrow it down, so please try with GGUFs generated with this PR. |
|
@ngxson I don't see it in the Jinja. |
|
then it should be irrelevant if chat template doesn't use it |
|
Confirm, it works now. |
|
Confirmed working with the latest changes (though, it was also working fine for me before). Marked as ready for review again. |
|
Confirm, let's merge ) |
|
💯 Nice work everyone! 🎉 |
Nevermind, it does seem that some operations are not supported on CUDA... this problem is above my pay-grade at this point, it will need to be addressed in another PR :( |
|
Quick question about interleaved/preserved thinking: |
|
@ggerganov there was a discussion around this: #18368 webui change is needed because the but as mentioned in the discussion, it can be a bit tricky as not all models support this. we can probably extend the |
|
@ngxson Btw, regarding #18936 (comment), I think there is still an issue. Consider this short conversation consisting of 2 messages:
When I dump the last prompt from the second message, I see this: curl http://127.0.0.1:8013/slots?model="glm-4.7-flash-hf" | jq .[2]
{
...
"prompt": "[gMASK]<sop><|user|>Say \"hi\"<|assistant|></think>Hello! How can I help you today?<|user|>Say \"bye\"<|assistant|><think>",
... ^ problem here?
}Notice that there isn't a corresponding |
|
It's actually a feature of the chat template (I know that looks strange, but I was surprised too): For ref, chat template of GLM 4.6 doesn't do that: |
|
Note that for messages that are not the last one, they also skip the opening tag |
Hm yes, looks like it is intentional. Weird. Anyway, I encountered a situation where during a longer conversation the model didn't close it's reasoning with |
|
Seems to be quantization sensitive. I created this quant and it seems to perform great:
|
|
@ggerganov Thanks for looking into this. Indeed, I couldn't test compare logprobs with vLLM because the model hasn't been properly supported by vLLM (would appreciate if someone can do) I compared line-by-line the HF implementation of this vs deepseek_v3 but so far I couldn't find any differences that could affect the result. |
|
Yeah, we should definitely check against a reference implementation to make sure. Overall it performs quite well so far - been running it today for a while. But this was one-off surprise where the reasoning was not properly closed. I would guess it's some tokenizer/template issue, but will do more experiments to see if I can narrow it down. @noctrex I am using the full precision so it's not a quantization issue. |
|
@ggerganov Weird, as I've seen reports of users that try out the Q4 and below quants reporting strange issues, such as starting to think in chinese or thinking endlessly, and just returning gibberish answers. |
|
@ggerganov you can use the autoparser branch, it has support for reasoning and tool calling for this one properly already :> |
|
Ok so thanks to @bartowski1182 I got the logprobs from vllm vs F16 on llama.cpp. There are indeed some differences, and I just realized that since the prompt is repeated in Looking deeper into the code now..
|
An option based off @ggerganov Interleaved thinking isn't leveraged in basic chat. This check will prune any reasoning prior to the last user message, and in a chat session that's guaranteed to always prune: It's only leveraged during tool calling. Something to consider when implementing MCP support. |




Support for
Glm4MoeLiteForCausalLM, which seems to be just a renamed version of DeepseekV3 with some code moved around. Thanks to @ngxson for the help.ref: