Skip to content

Conversation

@hksdpc255
Copy link
Contributor

@hksdpc255 hksdpc255 commented Nov 14, 2025

This patch is ported from upstream PR #16932 and additionally incorporates the most recent changes from minja to ensure compatibility.


Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.

Introduces a generalized implementation for almost all XML-style tool-call formats.

Supported models

  • GLM 4.5/4.6
  • MiniMax M2
  • SeedOSS
  • Kimi-K2 (Thinking and non-thinking)
  • Qwen3-Coder (Thinking and non-thinking)
  • Apriel-1.5
  • Xiaomi-MiMo

Grammar-constrained tool-call outputs

Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.

Streaming support for tool-call parsing

The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.

Automatic chat-template fixing

A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.

In-context reasoning

The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.

Enhanced unit tests

Add unit test for streaming-mode parser. It simulates the generation phase by feeding content character-by-character, comparing the parsed results and verifying that streaming and non-streaming modes reach the same final state.

Additional Notes

  • All unit tests have passed.
  • Community testing is welcome! Please try it out with your model integrations.
  • If your OpenAI-compatible client does not support sending reasoning_content back to the server, use the option --reasoning-format none
  • When reporting issues, it’s recommended to add -lv 1 in the command line to enable more detailed logging.

@hksdpc255 hksdpc255 changed the title port for common: Generalized XML-style tool-call parsing with streaming support common: Generalized XML-style tool-call parsing with streaming support Nov 14, 2025
@hksdpc255
Copy link
Contributor Author

hksdpc255 commented Nov 14, 2025

Screenshot for Zed editor using Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL:

pic

Screenshot for Zed editor using MiniMax-M2:

pic2

@calvin2021y
Copy link

I test this patch with https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/tree/main/Q4_X , get this error:
Screenshot 2025-11-14 at 13 13 39

qwen3 coder work great.

@hksdpc255
Copy link
Contributor Author

@calvin2021y Use the template provided in this patch.

@calvin2021y
Copy link

@hksdpc255
Copy link
Contributor Author

Oh! I just realized I forgot to include the templates in this PR. I’ll add them shortly.

@calvin2021y
Copy link

calvin2021y commented Nov 14, 2025

I try run with --chat-template Kimi-K2.jinja download from https://github.com/ggml-org/llama.cpp/blob/374c06199910ab5d7c9d83311c07513eb0220927/models/templates/Kimi-K2.jinja, response in webui like this:

2} | 28 ++++++++++++++++++-------
1 file changed, 20 insertions(+), 8 deletions(-)
rename common/src/jinja/{gm.jinja2 => Kimi-K2.jinja2} (81%)

diff --git a/common/src/jinja/gm.jinja2 b/common/src

not work in zed too.

I try with --special and without it, both not work.

I will rebuild with your new commit and test again

@hksdpc255
Copy link
Contributor Author

@calvin2021y new commit doesn't change any source code.

Could you provide more logs or screenshot? That will help me to dig out what's going on.

@hksdpc255
Copy link
Contributor Author

I think it would be best to ask @ikawrakow to help confirm whether this issue is caused by something in my PR or by a misconfiguration elsewhere. I’m not fully certain which side the problem originates from, so a second opinion would be very helpful.

@ikawrakow
Copy link
Owner

The model responding with the first line of the Iliad in ancient Greek to "hi" does not seem right. It is probably best to first establish that the model is working (no tool calling and such) on the current main branch before trying to diagnose if there are bugs in this PR, or perhaps in PR #954 that appears to also have been merged for this test.

@calvin2021y
Copy link

calvin2021y commented Nov 14, 2025

I remove pr954 and test without --chat-template Kimi-K2.jinja get zed template error like before, but the buildin webui work as expect.

then I add --chat-template Kimi-K2.jinja to test with buildin webui, the response is random with simple input hi.

I guess the template feed bad input into the model.

kimi k2 think need --special to show think token, maybe this related?

I will test mainline with -lv 1 args.

@calvin2021y
Copy link

--chat-template Kimi-K2.jinja response context include keyword Kimi-K2.jinja for simple input hi.

@hksdpc255
Copy link
Contributor Author

Is there a way to display the prompt just rendered by Minja?

@hksdpc255
Copy link
Contributor Author

@calvin2021y Wait, should the --chat-template option actually be --chat-template-file? I just realized you might be using the wrong argument.

@calvin2021y
Copy link

calvin2021y commented Nov 14, 2025

@calvin2021y Wait, should the --chat-template option actually be --chat-template-file? I just realized you might be using the wrong argument.

I try with --chat-template-file and your ninja file, buildin webui work with kimi k2 think.

zed show : Tool call not found

@ikawrakow ik_llama --help can not find --chat-template-file information.

@ikawrakow
Copy link
Owner

ik_llama --help can not find --chat-template-file information.

Thanks! Fixed now.

@hksdpc255
Copy link
Contributor Author

@calvin2021y Would you be able to share a sample response from the model so I can better understand the issue?

@hksdpc255
Copy link
Contributor Author

@calvin2021y Try --log-enable and get llama.log from CWD?

@moooV252
Copy link

moooV252 commented Nov 14, 2025

I've compiled the latest repo with this PR, it doesn't quite work with a Kimi K2 template included into PR.
I'm using unsloth Kimi K2 Thinking UD-Q4_K_XL quant with this.

Here is the command string and a log:

llama-server.exe -m \llamacpp_models\UD-Q4_K_XL\Kimi-K2-Thinking-UD-Q4_K_XL-00001-of-00014.gguf --port 11434 --host 0.0.0.0 --ctx-size 204800 --temp 1.0 --min-p 0.01 --jinja --numa distribute --threads 96 -ctk q8_0 -ctv q8_0 -amb 512 -mla 3 -ngl 42 -ot exps=CPU --parallel 1 --timeout 3600 --chat-template Kimi-K2.jinja
.
startup log omitted
.
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
======================================= HAVE_FANCY_SIMD is defined
Failed to infer a tool call example (possible template bug)
INFO [                    init] initializing slots | tid="10840" timestamp=1763107735 n_slots=1
INFO [                    init] new slot | tid="10840" timestamp=1763107735 id_slot=0 n_ctx_slot=204800
INFO [                    main] model loaded | tid="10840" timestamp=1763107735
INFO [                    main] chat template | tid="10840" timestamp=1763107735 chat_template="Kimi-K2.jinja"
INFO [                    main] chat template | tid="10840" timestamp=1763107735 chat_example="Kimi-K2.jinja" built_in=false
INFO [                    main] HTTP server listening | tid="10840" timestamp=1763107735 hostname="0.0.0.0" port="11434" n_threads_http="47"
INFO [            update_slots] all slots are idle | tid="10840" timestamp=1763107735
INFO [      log_server_request] request | tid="12436" timestamp=1763107921 remote_addr="192.168.10.112" remote_port=62449 status=200 method="GET" path="/v1/models" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="10840" timestamp=1763107944 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="10840" timestamp=1763107944 id_slot=0 id_task=0 p0=0
INFO [      log_server_request] request | tid="13876" timestamp=1763107982 remote_addr="192.168.10.112" remote_port=62459 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] slot released | tid="10840" timestamp=1763107982 id_slot=0 id_task=0 n_ctx=204800 n_past=107 n_system_tokens=0 n_cache_tokens=107 truncated=false
INFO [            update_slots] all slots are idle | tid="10840" timestamp=1763107982
INFO [            update_slots] all slots are idle | tid="10840" timestamp=1763107983

The generated output is the following:

2", "r") as f:
template = Template(f.read())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())
# print(template.render())

If I run it without the --chat-template argument using unsloth's bundled template I get an ordinary log without any errors, but infinite sequences of question marks or exclamation marks in the chat output like "???????????????????.... many more" or "!!!!!!!!!!!!!!!!!!!!!!!!!!... many more"

However, it applies to RooCode plugin in architect or code modes. I haven't tested the web ui, though.

UPD:
This behavior happens only if I continue the already existing chat having 80k+ tokens (files included) and the context is re-calculated from scratch.
I've launched it without a custom template (using unsloth one) and it looks like it's working ok with occasional <|im_end|> tokens popping up in chat. However, the thinking is properly recognized by RooCode now without pouring everything into the chat itself.

@calvin2021y
Copy link

hi @moooV252

you need use --chat-template-file

@hksdpc255
Copy link
Contributor Author

hi @hksdpc255

this is log:

[1763117581] Matched tool start: "<|tool_calls_section_begin|><|tool_call_begin|>"
[1763117581] Partial parse: incomplete tool calls
[1763117581] Parsing input with format Kimi K2: <think>The user is asking "how many space left on my disk" which means they want to know their disk usage/available space. This is a system query that I need to run using the terminal tool.

Let me use the terminal tool to check disk space. On macOS (which is the operating system mentioned in the system info), I can use the `df` command to check disk space.

The correct format for the terminal tool is:
- command: a shell one-liner
- cd: a working directory (must be one of the root directories)

I'll use `df -h` to get human-readable disk space information. I need to specify a cd parameter that is one of the root directories in the project.

Looking at the tool schema:
- command: "The one-liner command to execute"
- cd: "Working directory for the command. This must be one of the root directories of the project."

I need to provide both parameters. Since I don't know the root directories of the project (they weren't listed in the context), I need to be careful. Let me just pick a common directory or use the current directory approach.

Actually, looking at the system info again, it says "Operating System: macos" and "Default Shell: sh". But I don't see any root directories listed in the context section as mentioned in the instructions. The instructions say "the following root directories:" but then nothing is listed.

Let me try using "/" as the cd parameter or maybe just "." but I need to check what's valid. Since no root directories were provided in the context, I should probably use a generic approach.</think><|tool_calls_section_begin|><|tool_call_begin|>functions.terminal:0<|tool_call_argument_begin|>{"command": "df -h", "cd": "/"}<|tool_call_end|><|tool_calls_section_end|>
[1763117581] Matched tool start: "<|tool_calls_section_begin|><|tool_call_begin|>"
[1763117581] Parsing input with format Kimi K2: <think>The user is asking "how many space left on my disk" which means they want to know their disk usage/available space. This is a system query that I need to run using the terminal tool.

Let me use the terminal tool to check disk space. On macOS (which is the operating system mentioned in the system info), I can use the `df` command to check disk space.

The correct format for the terminal tool is:
- command: a shell one-liner
- cd: a working directory (must be one of the root directories)

I'll use `df -h` to get human-readable disk space information. I need to specify a cd parameter that is one of the root directories in the project.

Looking at the tool schema:
- command: "The one-liner command to execute"
- cd: "Working directory for the command. This must be one of the root directories of the project."

I need to provide both parameters. Since I don't know the root directories of the project (they weren't listed in the context), I need to be careful. Let me just pick a common directory or use the current directory approach.

Actually, looking at the system info again, it says "Operating System: macos" and "Default Shell: sh". But I don't see any root directories listed in the context section as mentioned in the instructions. The instructions say "the following root directories:" but then nothing is listed.

Let me try using "/" as the cd parameter or maybe just "." but I need to check what's valid. Since no root directories were provided in the context, I should probably use a generic approach.</think><|tool_calls_section_begin|><|tool_call_begin|>functions.terminal:0<|tool_call_argument_begin|>{"command": "df -h", "cd": "/"}<|tool_call_end|><|tool_calls_section_end|><|im_end|>

the size is huge, let me know if you need full data.

@calvin2021y The log looks as expected. What is Zed editor complaining about?

@calvin2021y
Copy link

here is the logs I test from a Q1 (Q4 is too slow):

[1763123667] Parsing input with format Kimi K2: <think>The user wants a title for this conversation. The conversation appears to be about checking disk space. I need to generate a concise title of 3-7 words, omitting punctuation.

Key points:
- The user asked "how many space left on my disk"
- I attempted to check disk space (though there was a JSON parsing error)
- The subject is clearly about disk space

Possible titles:
- "Check disk space left" (4 words)
- "Disk space available" (3 words)
- "How much disk space" (4 words)
- "Check available disk space" (4 words)
- "Disk space remaining" (3 words)
- "Available disk space" (3 words)

.........................

I need to provide both the command and the cd parameter. Looking at the System Information, this is macOS
[1763123777] Parsing input with format Kimi K2: <think>The user is asking "how many space left on my disk" which means they want to know their disk space usage/available space. I need to use the terminal tool to run a command to check disk space. The terminal tool requires JSON with "command" and "cd" parameters.

Let me check what commands are available on macOS to check disk space:
- `df -h` - shows disk space in human-readable format
- `df -H` - shows disk space in human-readable format (with different units)
- `du` - shows disk usage

The proper tool call should be:
```json
{
  "command": "df -h",
  "cd": "backend"  // or whichever root directory is available
}

zed show error loop: Tool call not found

Screenshot 2025-11-14 at 20 36 06

@hksdpc255
Copy link
Contributor Author

hksdpc255 commented Nov 14, 2025

@calvin2021y Will sending requests using curl works for you?

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"model","messages":[{"role":"user","content":"check what time it is"}],"tools":[{"type":"function","function":{"name":"foobar","description":"gets the current time","parameters":{"type":"object","properties":{},"additionalProperties":false},"strict":true}}]}'

@calvin2021y
Copy link

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"model","messages":[{"role":"user","content":"check what time it is"}],"tools":[{"type":"function","function":{"name":"foobar","description":"gets the current time","parameters":{"type":"object","properties":{},"additionalProperties":false},"strict":true}}]}' |jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1150    0   883  100   267    108     32  0:00:08  0:00:08 --:--:--   226
{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user wants to know the current time. I have a function called \"foobar\" that is described as \"gets the current time\". I should call this function to get the current time and provide it to the user.",
        "content": "<|im_end|>",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "foobar",
              "arguments": "{}"
            },
            "id": "579vb9nfc5QHdRfBt2hQ8UQCh1aYu1Ke"
          }
        ]
      }
    }
  ],
  "created": 1763128240,
  "model": "model",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 58,
    "prompt_tokens": 80,
    "total_tokens": 138
  },
  "id": "chatcmpl-DYNwF6mohBuG6qNaOS1k6bJH7xsAqrTs",
  "timings": {
    "prompt_n": 80,
    "prompt_ms": 1986.269,
    "prompt_per_token_ms": 24.8283625,
    "prompt_per_second": 40.2765184373315,
    "predicted_n": 58,
    "predicted_ms": 5878.277,
    "predicted_per_token_ms": 101.34960344827586,
    "predicted_per_second": 9.866836829907811
  }
}

@hksdpc255
Copy link
Contributor Author

hi @hksdpc255

I am use curl to do the request, since Zed will send a lot request to slow down the process. hope it work for you.

Parsed message: {"role":"assistant","content":"I'll check your disk space usage for you.","reasoning_content":"The user is asking about disk space left on their system. I should use the terminal tool to check disk space. On macOS (which is mentioned in the system info), I can use commands like `df -h` to check disk space in a human-readable format.\n\nLet me execute this command to see the disk space usage.","tool_calls":[{"type":"function","function":{"name":"terminal","arguments":"{\"command\":\"df -h\",\"cd\":\".\"}"}}]}

response:

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking about disk space left on their system. I should use the terminal tool to check disk space. On macOS (which is mentioned in the system info), I can use commands like `df -h` to check disk space in a human-readable format.\n\nLet me execute this command to see the disk space usage.",
        "content": "I'll check your disk space usage for you.",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "terminal",
              "arguments": "{\"command\":\"df -h\",\"cd\":\".\"}"
            },
            "id": "bwWDN3cpUVEQWVKPmlZz7KnP7JkqDlBk"
          }
        ]
      }
    }
  ],
  "created": 1763194914,
  "model": "a",
  "system_fingerprint": "b7062-9b17d74ab",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 101,
    "prompt_tokens": 6494,
    "total_tokens": 6595
  },
  "id": "chatcmpl-UJSZcwuRwYYz8RSIPKUzkT4YtCuYYma8",
  "timings": {
    "cache_n": 0,
    "prompt_n": 6494,
    "prompt_ms": 353355.188,
    "prompt_per_token_ms": 54.41256359716662,
    "prompt_per_second": 18.378108544991843,
    "predicted_n": 101,
    "predicted_ms": 19671.08,
    "predicted_per_token_ms": 194.7631683168317,
    "predicted_per_second": 5.134441016965006
  }
}

This result is still expected. Could you try another model, such as Qwen3-Coder-30B, to confirm whether the issue is not caused by the Zed editor?

@moooV252
Copy link

@hksdpc255
I don't know how to get the underlying messages actually being sent to and from the server, but it was able to successfully do multiple different tool calls including writing to files (I've rolled that back so it's not included into the screenshot) before that - just outside the thinking block.

The problem starts when it tries to invoke a tool inside the thinking block - which means it wasn't properly closed. I don't know if it's a parser issue or the LLM itself doesn't generate a token. Could it be possible to add it inside the parsing engine if a tool call is detected and the thinking context is not closed yet?

@calvin2021y
Copy link

Qwen3-Coder-30B

Qwen3-Coder-30B-UD8 work very well, for a lot tasks for me.

I will retry ik_llama.cpp with kimi k2 think.

I am not sure who to recreate the case @moooV252 said here, tool call from think block. if I can test this in curl will be much easy to confirm.

@hksdpc255
Copy link
Contributor Author

For the current implementation, when the model generates a tool-call scope start followed by a tool-call function start, the grammar forces it to produce a complete and valid tool-call message. If this happens during reasoning, the parser will simply ignore it.

In llama.cpp, the grammar system and the parser live in separate modules. This separation complicates the implementation and makes it difficult to keep their behaviors aligned.

@calvin2021y
Copy link

hi @hksdpc255

some time kimi response with "content":null, maybe this is the issue?

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking about disk space left on their system. This is a system information query, not related to the codebase. I should use the terminal tool to check disk space. I'll use the `df` command which is standard for checking disk space on Unix-like systems (which macOS is). I'll make it human-readable with the `-h` flag.",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "terminal",
              "arguments": "{\"command\":\"df -h\",\"cd\":\"/\"}"
            },
            "id": "WVZYS6czSGcckk3o6YFELdIW8RRY4pWX"
          }
        ]
      }
    }
  ],
  "created": 1763196056,
  "model": "a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 97,
    "prompt_tokens": 3257,
    "total_tokens": 3354
  },
  "id": "chatcmpl-kgVZ8PxGli2rCPBd5nzWCdy5Cf7xXRDt",
  "timings": {
    "prompt_n": 3257,
    "prompt_ms": 26644.305,
    "prompt_per_token_ms": 8.18062787841572,
    "prompt_per_second": 122.24000588493487,
    "predicted_n": 97,
    "predicted_ms": 9068.085,
    "predicted_per_token_ms": 93.48541237113402,
    "predicted_per_second": 10.696856061671236
  }
}

@hksdpc255
Copy link
Contributor Author

hi @hksdpc255

some time kimi response with "content":null, maybe this is the issue?

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking about disk space left on their system. This is a system information query, not related to the codebase. I should use the terminal tool to check disk space. I'll use the `df` command which is standard for checking disk space on Unix-like systems (which macOS is). I'll make it human-readable with the `-h` flag.",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "terminal",
              "arguments": "{\"command\":\"df -h\",\"cd\":\"/\"}"
            },
            "id": "WVZYS6czSGcckk3o6YFELdIW8RRY4pWX"
          }
        ]
      }
    }
  ],
  "created": 1763196056,
  "model": "a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 97,
    "prompt_tokens": 3257,
    "total_tokens": 3354
  },
  "id": "chatcmpl-kgVZ8PxGli2rCPBd5nzWCdy5Cf7xXRDt",
  "timings": {
    "prompt_n": 3257,
    "prompt_ms": 26644.305,
    "prompt_per_token_ms": 8.18062787841572,
    "prompt_per_second": 122.24000588493487,
    "predicted_n": 97,
    "predicted_ms": 9068.085,
    "predicted_per_token_ms": 93.48541237113402,
    "predicted_per_second": 10.696856061671236
  }
}

Nope. This is because the content is a empty string.

@calvin2021y
Copy link

I mean, maybe Zed not able to handle "content":null.

Can we change this into: "content":"" ?

@hksdpc255
Copy link
Contributor Author

hksdpc255 commented Nov 15, 2025

I can confirm that Zed editor can correctly handle the null content. Change null to "" will cause massive unit tests fails. I used to change null to "", but the maintaners rejected it.

@moooV252
Copy link

moooV252 commented Nov 16, 2025

I've ran a test session with --log-enable --log file ... arguments on a next version of a template (k2t) from this comment in the related issue: #955 (comment)

Still get the same result (tool call inside the thinking block), but now with a log trace - maybe it will help narrow the problem down.

Screenshot 2025-11-17 042627

And here's the related portion of logs:
Correct think block closure and a tool use afterwards:

[1763320891] Parsing input with format Kimi K2: <think>Let me start by gathering information about the Binance API and then create a comprehensive plan for the trading simulator. I need to use the context7 MCP server to get documentation about Binance API and Python async trading libraries.</think><use_mcp_tool>
<server_name>context7</server_name>
<tool_name>resolve-library-id</tool_name>
<arguments>
{
  "libraryName": "binance python api"
}
</arguments>
</use_mcp_tool>

And incorrect one:

[1763321121] Parsing input with format Kimi K2: <think><use_mcp_tool>
<server_name>context7</server_name>
<tool_name>get-library-docs</tool_name>
<arguments>
{
  "context7CompatibleLibraryID": "/binance/binance-connector-python",
  "topic": "async trading",
  "tokens": 3000
}
</arguments>
</use_mcp_tool>

I've noticed that when a tool use is wrongly invoked inside a thinking block it's actually empty - no thoughts given out by the LLM, but when it outputs any text at all the </think> token gets inserted correctly.

So, it narrows down to tracking if the opening of the thinking block is not immediately followed by a tool call - if so, insert the </think> token manually in parser to close the thinking block or even discard the first thinking token entirely (this will require postponing the output by at least one token ahead).

@hksdpc255
Copy link
Contributor Author

@moooV252 Thanks for the detailed log, that makes the issue clear.

Kimi-K2 does, in fact, emit two different tool-call formats:

  1. the regular documented tool-call format, and
  2. a second, undocumented format wrapped inside <use_mcp_tool>...</use_mcp_tool> blocks.

It’s unclear why Kimi-K2 uses two incompatible formats, but this behavior is model-side rather than parser-side. The current implementation only supports a single tool-call syntax, so the second form is parsed as plain text. Supporting both formats simultaneously would require additional special-case handling, since the two syntaxes differ structurally.

This issue documents the root cause. Whether or how to support the second format is a separate design question, as it would involve adding non-standard hacks to accommodate Kimi-K2’s inconsistent behavior.

Now we should ask the maintainer @ikawrakow whether partial implementation for a model is acceptable.
If not, we may need to remove the current Kimi-K2 support and wait for a more robust implementation.

@hksdpc255
Copy link
Contributor Author

@moooV252 See: ggml-org/llama.cpp#16932 (comment)

One of the maintainer confirmed that this problem is caused by roo code.

@hksdpc255
Copy link
Contributor Author

The upstream PR is now ready to merge, and all relevant changes have already been synced into this PR.

@calvin2021y
Copy link

calvin2021y commented Nov 17, 2025

Kimi think with new template, Zed still has: Tool call not found

Copy link
Collaborator

@firecoperana firecoperana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with GLM4.5 air and MiniMax M2. LGTM

@firecoperana
Copy link
Collaborator

@calvin2021y Can you try with mainline's PR and see if it behaves the same? If so, it can be fixed later.

@calvin2021y
Copy link

@calvin2021y Can you try with mainline's PR and see if it behaves the same? If so, it can be fixed later.

rebuild with mainline patch new template, Zed still has: Tool call not found

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.
@hksdpc255
Copy link
Contributor Author

Kimi-K2 is too large for my hardware to run, even with the most aggressive quantization. Unfortunately, I’m unable to test it myself. I can only run models with size less than 120G.

@ikawrakow ikawrakow merged commit da5de88 into ikawrakow:main Nov 18, 2025
@hksdpc255 hksdpc255 deleted the xml_toolcall branch November 19, 2025 03:05
sayap added a commit to sayap/ik_llama.cpp that referenced this pull request Nov 22, 2025
The logic to skip the logprobs of the stop token was originally from
ggml-org/llama.cpp#2849, and was later modified as part of
ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD.

The latter change wasn't included in ikawrakow#723. Then, after ikawrakow#958 got merged,
the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2,
resulting in truncated logprobs when streaming is off.

This commit reverts the logic from ggml-org/llama.cpp#2849, such that
the logprobs of the stop token will always be included in the response,
when logprobs is enabled. From testing, this matches with the behavior
of Fireworks inference server, for both chat completions and text
completions endpoints.

Also fix logprobs param handling for the text completion endpoint.
ikawrakow pushed a commit that referenced this pull request Nov 24, 2025
The logic to skip the logprobs of the stop token was originally from
ggml-org/llama.cpp#2849, and was later modified as part of
ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD.

The latter change wasn't included in #723. Then, after #958 got merged,
the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2,
resulting in truncated logprobs when streaming is off.

This commit reverts the logic from ggml-org/llama.cpp#2849, such that
the logprobs of the stop token will always be included in the response,
when logprobs is enabled. From testing, this matches with the behavior
of Fireworks inference server, for both chat completions and text
completions endpoints.

Also fix logprobs param handling for the text completion endpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants