-
Notifications
You must be signed in to change notification settings - Fork 160
common: Generalized XML-style tool-call parsing with streaming support #958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I test this patch with https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/tree/main/Q4_X , get this error: qwen3 coder work great. |
|
@calvin2021y Use the template provided in this patch. |
|
Oh! I just realized I forgot to include the templates in this PR. I’ll add them shortly. |
|
I try run with 2} | 28 ++++++++++++++++++-------
1 file changed, 20 insertions(+), 8 deletions(-)
rename common/src/jinja/{gm.jinja2 => Kimi-K2.jinja2} (81%)
diff --git a/common/src/jinja/gm.jinja2 b/common/srcnot work in zed too. I try with I will rebuild with your new commit and test again |
|
@calvin2021y new commit doesn't change any source code. Could you provide more logs or screenshot? That will help me to dig out what's going on. |
|
I think it would be best to ask @ikawrakow to help confirm whether this issue is caused by something in my PR or by a misconfiguration elsewhere. I’m not fully certain which side the problem originates from, so a second opinion would be very helpful. |
|
The model responding with the first line of the Iliad in ancient Greek to "hi" does not seem right. It is probably best to first establish that the model is working (no tool calling and such) on the current main branch before trying to diagnose if there are bugs in this PR, or perhaps in PR #954 that appears to also have been merged for this test. |
|
I remove pr954 and test without then I add I guess the template feed bad input into the model. kimi k2 think need I will test mainline with |
|
|
|
Is there a way to display the prompt just rendered by Minja? |
|
@calvin2021y Wait, should the |
I try with zed show : @ikawrakow |
Thanks! Fixed now. |
|
@calvin2021y Would you be able to share a sample response from the model so I can better understand the issue? |
|
@calvin2021y Try |
|
I've compiled the latest repo with this PR, it doesn't quite work with a Kimi K2 template included into PR. Here is the command string and a log: The generated output is the following: If I run it without the However, it applies to RooCode plugin in architect or code modes. I haven't tested the web ui, though. UPD: |
|
hi @moooV252 you need use |
@calvin2021y The log looks as expected. What is Zed editor complaining about? |
|
@calvin2021y Will sending requests using curl works for you? curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"model","messages":[{"role":"user","content":"check what time it is"}],"tools":[{"type":"function","function":{"name":"foobar","description":"gets the current time","parameters":{"type":"object","properties":{},"additionalProperties":false},"strict":true}}]}' |
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"model","messages":[{"role":"user","content":"check what time it is"}],"tools":[{"type":"function","function":{"name":"foobar","description":"gets the current time","parameters":{"type":"object","properties":{},"additionalProperties":false},"strict":true}}]}' |jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1150 0 883 100 267 108 32 0:00:08 0:00:08 --:--:-- 226
{
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "The user wants to know the current time. I have a function called \"foobar\" that is described as \"gets the current time\". I should call this function to get the current time and provide it to the user.",
"content": "<|im_end|>",
"tool_calls": [
{
"type": "function",
"function": {
"name": "foobar",
"arguments": "{}"
},
"id": "579vb9nfc5QHdRfBt2hQ8UQCh1aYu1Ke"
}
]
}
}
],
"created": 1763128240,
"model": "model",
"object": "chat.completion",
"usage": {
"completion_tokens": 58,
"prompt_tokens": 80,
"total_tokens": 138
},
"id": "chatcmpl-DYNwF6mohBuG6qNaOS1k6bJH7xsAqrTs",
"timings": {
"prompt_n": 80,
"prompt_ms": 1986.269,
"prompt_per_token_ms": 24.8283625,
"prompt_per_second": 40.2765184373315,
"predicted_n": 58,
"predicted_ms": 5878.277,
"predicted_per_token_ms": 101.34960344827586,
"predicted_per_second": 9.866836829907811
}
}
|
This result is still expected. Could you try another model, such as Qwen3-Coder-30B, to confirm whether the issue is not caused by the Zed editor? |
|
@hksdpc255 The problem starts when it tries to invoke a tool inside the thinking block - which means it wasn't properly closed. I don't know if it's a parser issue or the LLM itself doesn't generate a token. Could it be possible to add it inside the parsing engine if a tool call is detected and the thinking context is not closed yet? |
Qwen3-Coder-30B-UD8 work very well, for a lot tasks for me. I will retry ik_llama.cpp with kimi k2 think. I am not sure who to recreate the case @moooV252 said here, tool call from think block. if I can test this in curl will be much easy to confirm. |
|
For the current implementation, when the model generates a tool-call scope start followed by a tool-call function start, the grammar forces it to produce a complete and valid tool-call message. If this happens during reasoning, the parser will simply ignore it. In llama.cpp, the grammar system and the parser live in separate modules. This separation complicates the implementation and makes it difficult to keep their behaviors aligned. |
|
hi @hksdpc255 some time kimi response with "content":null, maybe this is the issue? {
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "The user is asking about disk space left on their system. This is a system information query, not related to the codebase. I should use the terminal tool to check disk space. I'll use the `df` command which is standard for checking disk space on Unix-like systems (which macOS is). I'll make it human-readable with the `-h` flag.",
"content": null,
"tool_calls": [
{
"type": "function",
"function": {
"name": "terminal",
"arguments": "{\"command\":\"df -h\",\"cd\":\"/\"}"
},
"id": "WVZYS6czSGcckk3o6YFELdIW8RRY4pWX"
}
]
}
}
],
"created": 1763196056,
"model": "a",
"object": "chat.completion",
"usage": {
"completion_tokens": 97,
"prompt_tokens": 3257,
"total_tokens": 3354
},
"id": "chatcmpl-kgVZ8PxGli2rCPBd5nzWCdy5Cf7xXRDt",
"timings": {
"prompt_n": 3257,
"prompt_ms": 26644.305,
"prompt_per_token_ms": 8.18062787841572,
"prompt_per_second": 122.24000588493487,
"predicted_n": 97,
"predicted_ms": 9068.085,
"predicted_per_token_ms": 93.48541237113402,
"predicted_per_second": 10.696856061671236
}
} |
Nope. This is because the content is a empty string. |
|
I mean, maybe Zed not able to handle "content":null. Can we change this into: "content":"" ? |
|
I can confirm that Zed editor can correctly handle the null content. Change null to "" will cause massive unit tests fails. I used to change null to "", but the maintaners rejected it. |
|
I've ran a test session with Still get the same result (tool call inside the thinking block), but now with a log trace - maybe it will help narrow the problem down.
And here's the related portion of logs: And incorrect one: I've noticed that when a tool use is wrongly invoked inside a thinking block it's actually empty - no thoughts given out by the LLM, but when it outputs any text at all the So, it narrows down to tracking if the opening of the thinking block is not immediately followed by a tool call - if so, insert the |
|
@moooV252 Thanks for the detailed log, that makes the issue clear. Kimi-K2 does, in fact, emit two different tool-call formats:
It’s unclear why Kimi-K2 uses two incompatible formats, but this behavior is model-side rather than parser-side. The current implementation only supports a single tool-call syntax, so the second form is parsed as plain text. Supporting both formats simultaneously would require additional special-case handling, since the two syntaxes differ structurally. This issue documents the root cause. Whether or how to support the second format is a separate design question, as it would involve adding non-standard hacks to accommodate Kimi-K2’s inconsistent behavior. Now we should ask the maintainer @ikawrakow whether partial implementation for a model is acceptable. |
|
@moooV252 See: ggml-org/llama.cpp#16932 (comment) One of the maintainer confirmed that this problem is caused by roo code. |
|
The upstream PR is now ready to merge, and all relevant changes have already been synced into this PR. |
|
Kimi think with new template, Zed still has: Tool call not found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with GLM4.5 air and MiniMax M2. LGTM
|
@calvin2021y Can you try with mainline's PR and see if it behaves the same? If so, it can be fixed later. |
rebuild with mainline patch new template, Zed still has: |
In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.
|
Kimi-K2 is too large for my hardware to run, even with the most aggressive quantization. Unfortunately, I’m unable to test it myself. I can only run models with size less than 120G. |
The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in ikawrakow#723. Then, after ikawrakow#958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.
The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in #723. Then, after #958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.





This patch is ported from upstream PR #16932 and additionally incorporates the most recent changes from minja to ensure compatibility.
Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.
Introduces a generalized implementation for almost all XML-style tool-call formats.
Supported models
Grammar-constrained tool-call outputs
Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.
Streaming support for tool-call parsing
The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.
Automatic chat-template fixing
A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.
In-context reasoning
The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.
Enhanced unit tests
Add unit test for streaming-mode parser. It simulates the generation phase by feeding content character-by-character, comparing the parsed results and verifying that streaming and non-streaming modes reach the same final state.
Additional Notes
--reasoning-format none-lv 1in the command line to enable more detailed logging.