common: Generalized XML-style tool-call parsing with streaming support #958

hksdpc255 · 2025-11-14T03:07:41Z

This patch is ported from upstream PR #16932 and additionally incorporates the most recent changes from minja to ensure compatibility.

Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.

Introduces a generalized implementation for almost all XML-style tool-call formats.

Supported models

GLM 4.5/4.6
MiniMax M2
SeedOSS
Kimi-K2 (Thinking and non-thinking)
Qwen3-Coder (Thinking and non-thinking)
Apriel-1.5
Xiaomi-MiMo

Grammar-constrained tool-call outputs

Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.

Streaming support for tool-call parsing

The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.

Automatic chat-template fixing

A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.

In-context reasoning

The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.

Enhanced unit tests

Add unit test for streaming-mode parser. It simulates the generation phase by feeding content character-by-character, comparing the parsed results and verifying that streaming and non-streaming modes reach the same final state.

Additional Notes

All unit tests have passed.
Community testing is welcome! Please try it out with your model integrations.
If your OpenAI-compatible client does not support sending reasoning_content back to the server, use the option --reasoning-format none
When reporting issues, it’s recommended to add -lv 1 in the command line to enable more detailed logging.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

hksdpc255 · 2025-11-14T03:18:22Z

Screenshot for Zed editor using Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL:

Screenshot for Zed editor using MiniMax-M2:

calvin2021y · 2025-11-14T05:15:04Z

I test this patch with https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/tree/main/Q4_X , get this error:

qwen3 coder work great.

hksdpc255 · 2025-11-14T05:17:26Z

@calvin2021y Use the template provided in this patch.

calvin2021y · 2025-11-14T05:23:11Z

thanks for the tips: now try https://github.com/ggml-org/llama.cpp/blob/374c06199910ab5d7c9d83311c07513eb0220927/models/templates/Kimi-K2.jinja

hksdpc255 · 2025-11-14T05:35:04Z

Oh! I just realized I forgot to include the templates in this PR. I’ll add them shortly.

calvin2021y · 2025-11-14T05:41:13Z

I try run with --chat-template Kimi-K2.jinja download from https://github.com/ggml-org/llama.cpp/blob/374c06199910ab5d7c9d83311c07513eb0220927/models/templates/Kimi-K2.jinja, response in webui like this:

2} | 28 ++++++++++++++++++-------
1 file changed, 20 insertions(+), 8 deletions(-)
rename common/src/jinja/{gm.jinja2 => Kimi-K2.jinja2} (81%)

diff --git a/common/src/jinja/gm.jinja2 b/common/src

not work in zed too.

I try with --special and without it, both not work.

I will rebuild with your new commit and test again

hksdpc255 · 2025-11-14T05:46:30Z

@calvin2021y new commit doesn't change any source code.

Could you provide more logs or screenshot? That will help me to dig out what's going on.

hksdpc255 · 2025-11-14T06:08:51Z

I think it would be best to ask @ikawrakow to help confirm whether this issue is caused by something in my PR or by a misconfiguration elsewhere. I’m not fully certain which side the problem originates from, so a second opinion would be very helpful.

ikawrakow · 2025-11-14T06:15:11Z

The model responding with the first line of the Iliad in ancient Greek to "hi" does not seem right. It is probably best to first establish that the model is working (no tool calling and such) on the current main branch before trying to diagnose if there are bugs in this PR, or perhaps in PR #954 that appears to also have been merged for this test.

calvin2021y · 2025-11-14T06:53:14Z

I remove pr954 and test without --chat-template Kimi-K2.jinja get zed template error like before, but the buildin webui work as expect.

then I add --chat-template Kimi-K2.jinja to test with buildin webui, the response is random with simple input hi.

I guess the template feed bad input into the model.

kimi k2 think need --special to show think token, maybe this related?

I will test mainline with -lv 1 args.

calvin2021y · 2025-11-14T06:56:35Z

--chat-template Kimi-K2.jinja response context include keyword Kimi-K2.jinja for simple input hi.

hksdpc255 · 2025-11-14T08:07:51Z

Is there a way to display the prompt just rendered by Minja?

hksdpc255 · 2025-11-14T08:21:35Z

@calvin2021y Wait, should the --chat-template option actually be --chat-template-file? I just realized you might be using the wrong argument.

calvin2021y · 2025-11-14T08:47:57Z

@calvin2021y Wait, should the --chat-template option actually be --chat-template-file? I just realized you might be using the wrong argument.

I try with --chat-template-file and your ninja file, buildin webui work with kimi k2 think.

zed show : Tool call not found

@ikawrakow ik_llama --help can not find --chat-template-file information.

ikawrakow · 2025-11-14T09:09:34Z

ik_llama --help can not find --chat-template-file information.

Thanks! Fixed now.

hksdpc255 · 2025-11-14T09:25:56Z

@calvin2021y Would you be able to share a sample response from the model so I can better understand the issue?

hksdpc255 · 2025-11-14T10:32:24Z

@calvin2021y Try --log-enable and get llama.log from CWD?

moooV252 · 2025-11-14T10:58:56Z

I've compiled the latest repo with this PR, it doesn't quite work with a Kimi K2 template included into PR.
I'm using unsloth Kimi K2 Thinking UD-Q4_K_XL quant with this.

Here is the command string and a log:

llama-server.exe -m \llamacpp_models\UD-Q4_K_XL\Kimi-K2-Thinking-UD-Q4_K_XL-00001-of-00014.gguf --port 11434 --host 0.0.0.0 --ctx-size 204800 --temp 1.0 --min-p 0.01 --jinja --numa distribute --threads 96 -ctk q8_0 -ctv q8_0 -amb 512 -mla 3 -ngl 42 -ot exps=CPU --parallel 1 --timeout 3600 --chat-template Kimi-K2.jinja
.
startup log omitted
.
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
======================================= HAVE_FANCY_SIMD is defined
Failed to infer a tool call example (possible template bug)
INFO [                    init] initializing slots | tid="10840" timestamp=1763107735 n_slots=1
INFO [                    init] new slot | tid="10840" timestamp=1763107735 id_slot=0 n_ctx_slot=204800
INFO [                    main] model loaded | tid="10840" timestamp=1763107735
INFO [                    main] chat template | tid="10840" timestamp=1763107735 chat_template="Kimi-K2.jinja"
INFO [                    main] chat template | tid="10840" timestamp=1763107735 chat_example="Kimi-K2.jinja" built_in=false
INFO [                    main] HTTP server listening | tid="10840" timestamp=1763107735 hostname="0.0.0.0" port="11434" n_threads_http="47"
INFO [            update_slots] all slots are idle | tid="10840" timestamp=1763107735
INFO [      log_server_request] request | tid="12436" timestamp=1763107921 remote_addr="192.168.10.112" remote_port=62449 status=200 method="GET" path="/v1/models" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="10840" timestamp=1763107944 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="10840" timestamp=1763107944 id_slot=0 id_task=0 p0=0
INFO [      log_server_request] request | tid="13876" timestamp=1763107982 remote_addr="192.168.10.112" remote_port=62459 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] slot released | tid="10840" timestamp=1763107982 id_slot=0 id_task=0 n_ctx=204800 n_past=107 n_system_tokens=0 n_cache_tokens=107 truncated=false
INFO [            update_slots] all slots are idle | tid="10840" timestamp=1763107982
INFO [            update_slots] all slots are idle | tid="10840" timestamp=1763107983

The generated output is the following:

2", "r") as f:
template = Template(f.read())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())

If I run it without the --chat-template argument using unsloth's bundled template I get an ordinary log without any errors, but infinite sequences of question marks or exclamation marks in the chat output like "???????????????????.... many more" or "!!!!!!!!!!!!!!!!!!!!!!!!!!... many more"

However, it applies to RooCode plugin in architect or code modes. I haven't tested the web ui, though.

UPD:
This behavior happens only if I continue the already existing chat having 80k+ tokens (files included) and the context is re-calculated from scratch.
I've launched it without a custom template (using unsloth one) and it looks like it's working ok with occasional <|im_end|> tokens popping up in chat. However, the thinking is properly recognized by RooCode now without pouring everything into the chat itself.

calvin2021y · 2025-11-14T11:14:09Z

hi @moooV252

you need use --chat-template-file

hksdpc255 · 2025-11-14T12:00:31Z

hi @hksdpc255

this is log:

[1763117581] Matched tool start: "<|tool_calls_section_begin|><|tool_call_begin|>"
[1763117581] Partial parse: incomplete tool calls
[1763117581] Parsing input with format Kimi K2: <think>The user is asking "how many space left on my disk" which means they want to know their disk usage/available space. This is a system query that I need to run using the terminal tool.

Let me use the terminal tool to check disk space. On macOS (which is the operating system mentioned in the system info), I can use the `df` command to check disk space.

The correct format for the terminal tool is:
- command: a shell one-liner
- cd: a working directory (must be one of the root directories)

I'll use `df -h` to get human-readable disk space information. I need to specify a cd parameter that is one of the root directories in the project.

Looking at the tool schema:
- command: "The one-liner command to execute"
- cd: "Working directory for the command. This must be one of the root directories of the project."

I need to provide both parameters. Since I don't know the root directories of the project (they weren't listed in the context), I need to be careful. Let me just pick a common directory or use the current directory approach.

Actually, looking at the system info again, it says "Operating System: macos" and "Default Shell: sh". But I don't see any root directories listed in the context section as mentioned in the instructions. The instructions say "the following root directories:" but then nothing is listed.

Let me try using "/" as the cd parameter or maybe just "." but I need to check what's valid. Since no root directories were provided in the context, I should probably use a generic approach.</think><|tool_calls_section_begin|><|tool_call_begin|>functions.terminal:0<|tool_call_argument_begin|>{"command": "df -h", "cd": "/"}<|tool_call_end|><|tool_calls_section_end|>
[1763117581] Matched tool start: "<|tool_calls_section_begin|><|tool_call_begin|>"
[1763117581] Parsing input with format Kimi K2: <think>The user is asking "how many space left on my disk" which means they want to know their disk usage/available space. This is a system query that I need to run using the terminal tool.

Let me use the terminal tool to check disk space. On macOS (which is the operating system mentioned in the system info), I can use the `df` command to check disk space.

The correct format for the terminal tool is:
- command: a shell one-liner
- cd: a working directory (must be one of the root directories)

I'll use `df -h` to get human-readable disk space information. I need to specify a cd parameter that is one of the root directories in the project.

Looking at the tool schema:
- command: "The one-liner command to execute"
- cd: "Working directory for the command. This must be one of the root directories of the project."

I need to provide both parameters. Since I don't know the root directories of the project (they weren't listed in the context), I need to be careful. Let me just pick a common directory or use the current directory approach.

Actually, looking at the system info again, it says "Operating System: macos" and "Default Shell: sh". But I don't see any root directories listed in the context section as mentioned in the instructions. The instructions say "the following root directories:" but then nothing is listed.

Let me try using "/" as the cd parameter or maybe just "." but I need to check what's valid. Since no root directories were provided in the context, I should probably use a generic approach.</think><|tool_calls_section_begin|><|tool_call_begin|>functions.terminal:0<|tool_call_argument_begin|>{"command": "df -h", "cd": "/"}<|tool_call_end|><|tool_calls_section_end|><|im_end|>

the size is huge, let me know if you need full data.

@calvin2021y The log looks as expected. What is Zed editor complaining about?

calvin2021y · 2025-11-14T12:38:00Z

here is the logs I test from a Q1 (Q4 is too slow):

[1763123667] Parsing input with format Kimi K2: <think>The user wants a title for this conversation. The conversation appears to be about checking disk space. I need to generate a concise title of 3-7 words, omitting punctuation.

Key points:
- The user asked "how many space left on my disk"
- I attempted to check disk space (though there was a JSON parsing error)
- The subject is clearly about disk space

Possible titles:
- "Check disk space left" (4 words)
- "Disk space available" (3 words)
- "How much disk space" (4 words)
- "Check available disk space" (4 words)
- "Disk space remaining" (3 words)
- "Available disk space" (3 words)

.........................

I need to provide both the command and the cd parameter. Looking at the System Information, this is macOS
[1763123777] Parsing input with format Kimi K2: <think>The user is asking "how many space left on my disk" which means they want to know their disk space usage/available space. I need to use the terminal tool to run a command to check disk space. The terminal tool requires JSON with "command" and "cd" parameters.

Let me check what commands are available on macOS to check disk space:
- `df -h` - shows disk space in human-readable format
- `df -H` - shows disk space in human-readable format (with different units)
- `du` - shows disk usage

The proper tool call should be:
```json
{
  "command": "df -h",
  "cd": "backend"  // or whichever root directory is available
}

zed show error loop: Tool call not found

hksdpc255 · 2025-11-14T13:17:50Z

@calvin2021y Will sending requests using curl works for you?

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"model","messages":[{"role":"user","content":"check what time it is"}],"tools":[{"type":"function","function":{"name":"foobar","description":"gets the current time","parameters":{"type":"object","properties":{},"additionalProperties":false},"strict":true}}]}'

calvin2021y · 2025-11-14T13:51:10Z

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"model","messages":[{"role":"user","content":"check what time it is"}],"tools":[{"type":"function","function":{"name":"foobar","description":"gets the current time","parameters":{"type":"object","properties":{},"additionalProperties":false},"strict":true}}]}' |jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1150    0   883  100   267    108     32  0:00:08  0:00:08 --:--:--   226
{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user wants to know the current time. I have a function called \"foobar\" that is described as \"gets the current time\". I should call this function to get the current time and provide it to the user.",
        "content": "<|im_end|>",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "foobar",
              "arguments": "{}"
            },
            "id": "579vb9nfc5QHdRfBt2hQ8UQCh1aYu1Ke"
          }
        ]
      }
    }
  ],
  "created": 1763128240,
  "model": "model",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 58,
    "prompt_tokens": 80,
    "total_tokens": 138
  },
  "id": "chatcmpl-DYNwF6mohBuG6qNaOS1k6bJH7xsAqrTs",
  "timings": {
    "prompt_n": 80,
    "prompt_ms": 1986.269,
    "prompt_per_token_ms": 24.8283625,
    "prompt_per_second": 40.2765184373315,
    "predicted_n": 58,
    "predicted_ms": 5878.277,
    "predicted_per_token_ms": 101.34960344827586,
    "predicted_per_second": 9.866836829907811
  }
}

hksdpc255 · 2025-11-15T08:29:21Z

hi @hksdpc255

I am use curl to do the request, since Zed will send a lot request to slow down the process. hope it work for you.

Parsed message: {"role":"assistant","content":"I'll check your disk space usage for you.","reasoning_content":"The user is asking about disk space left on their system. I should use the terminal tool to check disk space. On macOS (which is mentioned in the system info), I can use commands like `df -h` to check disk space in a human-readable format.\n\nLet me execute this command to see the disk space usage.","tool_calls":[{"type":"function","function":{"name":"terminal","arguments":"{\"command\":\"df -h\",\"cd\":\".\"}"}}]}

response:

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking about disk space left on their system. I should use the terminal tool to check disk space. On macOS (which is mentioned in the system info), I can use commands like `df -h` to check disk space in a human-readable format.\n\nLet me execute this command to see the disk space usage.",
        "content": "I'll check your disk space usage for you.",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "terminal",
              "arguments": "{\"command\":\"df -h\",\"cd\":\".\"}"
            },
            "id": "bwWDN3cpUVEQWVKPmlZz7KnP7JkqDlBk"
          }
        ]
      }
    }
  ],
  "created": 1763194914,
  "model": "a",
  "system_fingerprint": "b7062-9b17d74ab",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 101,
    "prompt_tokens": 6494,
    "total_tokens": 6595
  },
  "id": "chatcmpl-UJSZcwuRwYYz8RSIPKUzkT4YtCuYYma8",
  "timings": {
    "cache_n": 0,
    "prompt_n": 6494,
    "prompt_ms": 353355.188,
    "prompt_per_token_ms": 54.41256359716662,
    "prompt_per_second": 18.378108544991843,
    "predicted_n": 101,
    "predicted_ms": 19671.08,
    "predicted_per_token_ms": 194.7631683168317,
    "predicted_per_second": 5.134441016965006
  }
}

This result is still expected. Could you try another model, such as Qwen3-Coder-30B, to confirm whether the issue is not caused by the Zed editor?

moooV252 · 2025-11-15T08:31:16Z

@hksdpc255
I don't know how to get the underlying messages actually being sent to and from the server, but it was able to successfully do multiple different tool calls including writing to files (I've rolled that back so it's not included into the screenshot) before that - just outside the thinking block.

The problem starts when it tries to invoke a tool inside the thinking block - which means it wasn't properly closed. I don't know if it's a parser issue or the LLM itself doesn't generate a token. Could it be possible to add it inside the parsing engine if a tool call is detected and the thinking context is not closed yet?

calvin2021y · 2025-11-15T08:37:03Z

Qwen3-Coder-30B

Qwen3-Coder-30B-UD8 work very well, for a lot tasks for me.

I will retry ik_llama.cpp with kimi k2 think.

I am not sure who to recreate the case @moooV252 said here, tool call from think block. if I can test this in curl will be much easy to confirm.

hksdpc255 · 2025-11-15T08:43:03Z

For the current implementation, when the model generates a tool-call scope start followed by a tool-call function start, the grammar forces it to produce a complete and valid tool-call message. If this happens during reasoning, the parser will simply ignore it.

In llama.cpp, the grammar system and the parser live in separate modules. This separation complicates the implementation and makes it difficult to keep their behaviors aligned.

calvin2021y · 2025-11-15T08:43:49Z

hi @hksdpc255

some time kimi response with "content":null, maybe this is the issue?

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking about disk space left on their system. This is a system information query, not related to the codebase. I should use the terminal tool to check disk space. I'll use the `df` command which is standard for checking disk space on Unix-like systems (which macOS is). I'll make it human-readable with the `-h` flag.",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "terminal",
              "arguments": "{\"command\":\"df -h\",\"cd\":\"/\"}"
            },
            "id": "WVZYS6czSGcckk3o6YFELdIW8RRY4pWX"
          }
        ]
      }
    }
  ],
  "created": 1763196056,
  "model": "a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 97,
    "prompt_tokens": 3257,
    "total_tokens": 3354
  },
  "id": "chatcmpl-kgVZ8PxGli2rCPBd5nzWCdy5Cf7xXRDt",
  "timings": {
    "prompt_n": 3257,
    "prompt_ms": 26644.305,
    "prompt_per_token_ms": 8.18062787841572,
    "prompt_per_second": 122.24000588493487,
    "predicted_n": 97,
    "predicted_ms": 9068.085,
    "predicted_per_token_ms": 93.48541237113402,
    "predicted_per_second": 10.696856061671236
  }
}

hksdpc255 · 2025-11-15T08:44:42Z

hi @hksdpc255

some time kimi response with "content":null, maybe this is the issue?

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking about disk space left on their system. This is a system information query, not related to the codebase. I should use the terminal tool to check disk space. I'll use the `df` command which is standard for checking disk space on Unix-like systems (which macOS is). I'll make it human-readable with the `-h` flag.",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "terminal",
              "arguments": "{\"command\":\"df -h\",\"cd\":\"/\"}"
            },
            "id": "WVZYS6czSGcckk3o6YFELdIW8RRY4pWX"
          }
        ]
      }
    }
  ],
  "created": 1763196056,
  "model": "a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 97,
    "prompt_tokens": 3257,
    "total_tokens": 3354
  },
  "id": "chatcmpl-kgVZ8PxGli2rCPBd5nzWCdy5Cf7xXRDt",
  "timings": {
    "prompt_n": 3257,
    "prompt_ms": 26644.305,
    "prompt_per_token_ms": 8.18062787841572,
    "prompt_per_second": 122.24000588493487,
    "predicted_n": 97,
    "predicted_ms": 9068.085,
    "predicted_per_token_ms": 93.48541237113402,
    "predicted_per_second": 10.696856061671236
  }
}

Nope. This is because the content is a empty string.

calvin2021y · 2025-11-15T08:45:56Z

I mean, maybe Zed not able to handle "content":null.

Can we change this into: "content":"" ?

hksdpc255 · 2025-11-15T09:15:14Z

I can confirm that Zed editor can correctly handle the null content. Change null to "" will cause massive unit tests fails. I used to change null to "", but the maintaners rejected it.

moooV252 · 2025-11-16T19:39:08Z

I've ran a test session with --log-enable --log file ... arguments on a next version of a template (k2t) from this comment in the related issue: #955 (comment)

Still get the same result (tool call inside the thinking block), but now with a log trace - maybe it will help narrow the problem down.

And here's the related portion of logs:
Correct think block closure and a tool use afterwards:

[1763320891] Parsing input with format Kimi K2: <think>Let me start by gathering information about the Binance API and then create a comprehensive plan for the trading simulator. I need to use the context7 MCP server to get documentation about Binance API and Python async trading libraries.</think><use_mcp_tool>
<server_name>context7</server_name>
<tool_name>resolve-library-id</tool_name>
<arguments>
{
  "libraryName": "binance python api"
}
</arguments>
</use_mcp_tool>

And incorrect one:

[1763321121] Parsing input with format Kimi K2: <think><use_mcp_tool>
<server_name>context7</server_name>
<tool_name>get-library-docs</tool_name>
<arguments>
{
  "context7CompatibleLibraryID": "/binance/binance-connector-python",
  "topic": "async trading",
  "tokens": 3000
}
</arguments>
</use_mcp_tool>

I've noticed that when a tool use is wrongly invoked inside a thinking block it's actually empty - no thoughts given out by the LLM, but when it outputs any text at all the </think> token gets inserted correctly.

So, it narrows down to tracking if the opening of the thinking block is not immediately followed by a tool call - if so, insert the </think> token manually in parser to close the thinking block or even discard the first thinking token entirely (this will require postponing the output by at least one token ahead).

hksdpc255 · 2025-11-17T03:52:54Z

@moooV252 Thanks for the detailed log, that makes the issue clear.

Kimi-K2 does, in fact, emit two different tool-call formats:

the regular documented tool-call format, and
a second, undocumented format wrapped inside <use_mcp_tool>...</use_mcp_tool> blocks.

It’s unclear why Kimi-K2 uses two incompatible formats, but this behavior is model-side rather than parser-side. The current implementation only supports a single tool-call syntax, so the second form is parsed as plain text. Supporting both formats simultaneously would require additional special-case handling, since the two syntaxes differ structurally.

This issue documents the root cause. Whether or how to support the second format is a separate design question, as it would involve adding non-standard hacks to accommodate Kimi-K2’s inconsistent behavior.

Now we should ask the maintainer @ikawrakow whether partial implementation for a model is acceptable.
If not, we may need to remove the current Kimi-K2 support and wait for a more robust implementation.

hksdpc255 · 2025-11-17T11:56:34Z

@moooV252 See: ggml-org/llama.cpp#16932 (comment)

One of the maintainer confirmed that this problem is caused by roo code.

hksdpc255 · 2025-11-17T12:16:12Z

The upstream PR is now ready to merge, and all relevant changes have already been synced into this PR.

calvin2021y · 2025-11-17T15:28:44Z

Kimi think with new template, Zed still has: Tool call not found

firecoperana

I tested with GLM4.5 air and MiniMax M2. LGTM

firecoperana · 2025-11-17T15:49:55Z

@calvin2021y Can you try with mainline's PR and see if it behaves the same? If so, it can be fixed later.

calvin2021y · 2025-11-17T19:03:25Z

@calvin2021y Can you try with mainline's PR and see if it behaves the same? If so, it can be fixed later.

rebuild with mainline patch new template, Zed still has: Tool call not found

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

hksdpc255 · 2025-11-18T02:09:17Z

Kimi-K2 is too large for my hardware to run, even with the most aggressive quantization. Unfortunately, I’m unable to test it myself. I can only run models with size less than 120G.

The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in ikawrakow#723. Then, after ikawrakow#958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.

The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in #723. Then, after #958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.

port upstream ggml-org/llama.cpp#16932

fd551eb

hksdpc255 changed the title ~~port for common: Generalized XML-style tool-call parsing with streaming support~~ common: Generalized XML-style tool-call parsing with streaming support Nov 14, 2025

hksdpc255 mentioned this pull request Nov 14, 2025

Support Kimi-K2 ochafik/minja#10

Open

hksdpc255 added 2 commits November 14, 2025 04:35

Add fixed chat templates.

83529f3

Merge branch 'ikawrakow:main' into xml_toolcall

5d97be9

ikawrakow mentioned this pull request Nov 14, 2025

Bug: conversion to BF16 fails for Kimi K2 Thinking #942

Open

fix grammar when tool have no argument

ae8de45

Try fix Jinja template for GLM

466b717

Lissanro mentioned this pull request Nov 15, 2025

Feature Request: Support for Kimi K2 Thinking tool calling #955

Closed

4 tasks

hksdpc255 mentioned this pull request Nov 17, 2025

common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) ggml-org/llama.cpp#16932

Merged

Improve Kimi-K2 chat template

27ac1d4

ikawrakow requested a review from firecoperana November 17, 2025 13:57

firecoperana approved these changes Nov 17, 2025

View reviewed changes

ubergarm mentioned this pull request Nov 18, 2025

Bug: additional <|im_end|> tag in kimi k2 thinking at the end of an answer. #975

Closed

Fix "Invalid tool call arguments passed" in a rare case.

ff42d4f

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

ikawrakow merged commit da5de88 into ikawrakow:main Nov 18, 2025

hksdpc255 deleted the xml_toolcall branch November 19, 2025 03:05

Lissanro mentioned this pull request Nov 20, 2025

[BUG] Kimi K2 Thinking doesn't work very well with Chutes.AI RooCodeInc/Roo-Code#9366

Open

sayap mentioned this pull request Nov 22, 2025

Fix truncated logprobs when streaming is off #998

Merged

4 tasks

common: Generalized XML-style tool-call parsing with streaming support #958

common: Generalized XML-style tool-call parsing with streaming support #958

Uh oh!

Conversation

hksdpc255 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Supported models

Grammar-constrained tool-call outputs

Streaming support for tool-call parsing

Automatic chat-template fixing

In-context reasoning

Enhanced unit tests

Additional Notes

Uh oh!

hksdpc255 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calvin2021y commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

calvin2021y commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

calvin2021y commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

ikawrakow commented Nov 14, 2025

Uh oh!

calvin2021y commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calvin2021y commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

calvin2021y commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

moooV252 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calvin2021y commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025

Uh oh!

calvin2021y commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calvin2021y commented Nov 14, 2025

Uh oh!

hksdpc255 commented Nov 15, 2025

Uh oh!

moooV252 commented Nov 15, 2025

Uh oh!

calvin2021y commented Nov 15, 2025

Uh oh!

hksdpc255 commented Nov 15, 2025

Uh oh!

calvin2021y commented Nov 15, 2025

Uh oh!

hksdpc255 commented Nov 15, 2025

Uh oh!

calvin2021y commented Nov 15, 2025

Uh oh!

hksdpc255 commented Nov 14, 2025 •

edited

Loading

hksdpc255 commented Nov 14, 2025 •

edited

Loading

calvin2021y commented Nov 14, 2025 •

edited

Loading

calvin2021y commented Nov 14, 2025 •

edited

Loading

calvin2021y commented Nov 14, 2025 •

edited

Loading

moooV252 commented Nov 14, 2025 •

edited

Loading

hksdpc255 commented Nov 14, 2025 •

edited

Loading

hksdpc255 commented Nov 15, 2025 •

edited

Loading

moooV252 commented Nov 16, 2025 •

edited

Loading

calvin2021y commented Nov 17, 2025 •

edited

Loading

firecoperana left a comment •

edited

Loading