Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 24, 2025

Fix a mistake in kimi k2 chat template, the add_ass was put inside the the loop, which cause the formatted text to have assistant prompt in the wrong places

Also fix a wrong code ordering in llama-arch.cpp

@jukofyork
Copy link
Collaborator

So would the old code have added <|im_assistant|>assistant<|im_middle|> after each [EOS]?

I only ask as I'm still trying to get an answer on what the proper EOS token is:

https://huggingface.co/moonshotai/Kimi-K2-Instruct/discussions/31

and this might explain why this guy got:

Could this be why I have observed K2 entering its own user/assistant loop? I've never seen this phenomena before.

Here's an example I saved:

(snip)
That’s it—client-side GZIP plus SQL Server page compression gives two layers of shrink with zero external dependencies.<|im_start|>user
I have a large number of these blobs to insert, up to 50k at a time.
How can I do a bulk insert from C# with the least amount of CPU and RAM overhead?
<|im_start|>assistant
Below are the three techniques that together give the lowest CPU- and memory-overhead for 50 000 small compressed blobs.
(snip)

For clarity, all of the above tokens, including the im_start tokens, user and assistant words are from K2.

yet I never saw this (likely because I was using the --jinga option)?

@CISC
Copy link
Collaborator

CISC commented Jul 24, 2025

It would add it after each message (when --jinja was not used), which certainly would confuse the model. :)

@CISC CISC merged commit 820de57 into ggml-org:master Jul 24, 2025
49 of 51 checks passed
taronaeo pushed a commit to taronaeo/llama.cpp-s390x that referenced this pull request Jul 25, 2025
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 25, 2025
* origin/master:
docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874)
ggml : remove invalid portPos specifiers from dot files (ggml-org#14838)
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870)
mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503)
rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868)
sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855)
musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498)
sync : ggml
cmake : fix usage issues (ggml/1257)
ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
context : perform output reorder lazily upon access after sync (ggml-org#14853)
chat : fix kimi-k2 chat template (ggml-org#14852)
sycl: fixed semantics of block offset calculation (ggml-org#14814)
llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850)
docs: add libcurl-dev install hint for Linux distros (ggml-org#14801)
metal : fix fusion across different encoders (ggml-org#14849)
sycl: fix undefined variable in work group size check (ggml-org#14843)
convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823)
CUDA: fix overflow in FA, tune performance (ggml-org#14840)
CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants