[Tool Parser] Kimi K2: guided decoding for tool_choice="auto" — 75% → 100% schema accuracy by ZhanqiuHu · Pull Request #36891 · vllm-project/vllm

ZhanqiuHu · 2026-03-12T14:32:32Z

Co-authored with @yzong-rh

Purpose

The Kimi K2 tool parser currently relies on post-hoc parsing for tool_choice="auto" — the model generates freely and vLLM extracts tool calls afterward. This works most of the time, but the model can hallucinate tool names not in the user's schema (e.g., calling img_gen when only search is available), causing schema validation failures.

This PR adds generation-time enforcement via xgrammar's structural tag mechanism, ensuring that once the model decides to make a tool call, it can only produce tool names and arguments that conform to the provided schema. This is the first tool parser in vLLM to use guided decoding for tool_choice="auto".

For background on Kimi K2 tool calling on vLLM, see: Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM.

Key benefits:

100% schema accuracy on the K2-Vendor-Verifier benchmark (up from 75.4%), eliminating all tool name hallucination
Zero overhead for non-tool-call tokens — the grammar only activates after the <|tool_call_begin|> trigger, so free-text generation is unconstrained
Composable with existing behavior — tool_choice="required" and forced function still use the base class JSON schema path; this only fills the gap for "auto"
Generalizable pattern — the same TriggeredTagsFormat approach can be applied to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination issues

Summary

This is the first tool parser in vLLM to apply guided decoding for tool_choice="auto", and the approach generalizes to other parsers
Add xgrammar structural tag guided decoding to the Kimi K2 tool parser when tool_choice is "auto" or unset
Eliminates tool name hallucination (e.g., model calling img_gen when only search/urls_fetch_tool are available) by constraining generation at the token level
No change to tool_choice="required" or forced function behavior (handled by base class)

Approach

Override adjust_request() in KimiK2ToolParser to build a TriggeredTagsFormat structural tag from the request's tool definitions:

Trigger: <|tool_call_begin|> — free text allowed until this token
Per-tool tag: <|tool_call_begin|>{name}:\d+<|tool_call_argument_begin|>{json}<|tool_call_end|>
Composable content: sequence of regex (call ID) + const_string (argument marker) + json_schema (parameters)
Supports multiple tool calls per response (stop_after_first=False)
Respects existing structured_outputs if already set (e.g., by tool_choice="required")

Evaluation

K2-Vendor-Verifier benchmark, 2000 samples, moonshotai/Kimi-K2-Instruct-0905 (revision 94a4053eb8863059dd8afc00937f054e1365abbd):

	Tool Calls	Schema Errors	Accuracy
Baseline (no guided decoding)	678	167	75.4%
This PR	677	0	100%

The dominant failure mode in the baseline was tool name hallucination — the model generating calls to tools not in the provided schema (e.g., img_gen). With structural tag enforcement, the grammar only allows tokens that match valid tool names after the <|tool_call_begin|> trigger.

Reproduction:

# Server
vllm serve moonshotai/Kimi-K2-Instruct-0905 \
  --revision 94a4053eb8863059dd8afc00937f054e1365abbd \ # changing this might results in regression, still verifying
  --tensor-parallel-size 8 --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser kimi_k2

# Eval (using K2-Vendor-Verifier)
python tool_calls_eval.py downloads/tool-calls/samples.jsonl \
  --model moonshotai/Kimi-K2-Instruct-0905 \
  --base-url http://localhost:8000/v1 --api-key dummy \
  --concurrency 8 --temperature 0.6 --max-tokens 64000 \
  --output results.jsonl --summary summary.json

Caveats/Limitations

Performance not benchmarked — throughput/latency overhead of structural tag guided decoding has not been measured. The grammar only constrains tokens inside tool calls (not free text), so overhead should be minimal, but this needs validation.

Future work

Integrate per-function strict parameter for argument schema guidance (add strict to FunctionDefinition in vLLM's protocol layer first).
Generalize this approach to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination in tool_choice="auto"
Validate tool_choice='required' path.
Benchmark throughput/latency overhead of structural tag guided decoding vs. unconstrained generation

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces guided decoding for the Kimi K2 tool parser when tool_choice is 'auto', significantly improving schema accuracy by preventing tool name hallucination. The implementation uses xggrammar's structural tags to constrain generation. The changes are well-targeted for the 'auto' use case. However, I've identified a critical issue where tool_choice='required' is likely non-functional due to an incompatibility between the base class's guidance mechanism and this parser's expectation of special tokens. I've left a comment with details on the issue and a suggestion for a fix.

gemini-code-assist · 2026-03-12T14:36:15Z

vllm/tool_parsers/kimi_k2_tool_parser.py

+    def adjust_request(
+        self, request: ChatCompletionRequest
+    ) -> ChatCompletionRequest:
+        request = super().adjust_request(request)
+
+        if request.structured_outputs is not None:
+            return request
+
+        if request.tools and request.tool_choice in ("auto", None):
+            tag_json = self._build_structural_tag(request.tools)
+            if tag_json is not None:
+                request.structured_outputs = StructuredOutputsParams(
+                    structural_tag=tag_json
+                )
+
+        if request.tools and request.tool_choice != "none":
+            request.skip_special_tokens = False
+
+        return request


While this implementation correctly handles tool_choice='auto', it appears that tool_choice='required' may be broken. For required mode, super().adjust_request() is called, which sets a plain JSON schema constraint. This causes the model to generate raw JSON, without the special tokens (<|tool_call_begin|>, etc.) that this parser's extract_tool_calls method expects. The early return if request.structured_outputs is not None: prevents the new structural tag logic from being applied.

This likely results in required tool calls failing to be parsed. To fix this, you could handle required mode here using structural tags, similar to how you've handled auto. This would involve modifying _build_structural_tag to set "at_least_one": True when tool_choice is "required".

This PR should not touch the original path, but I haven't verified the =required path. Will note that in the description as a future work.

I agree that there's no need to handle this as part of this PR. The general problem Gemini is pointing out is that for tool_choice='required' we always guide the model to produce JSON as opposed to producing the model-specific tool calling format. That's a more general problem we need to solve, larger in scope than just this change.

mergify · 2026-03-12T14:39:29Z

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

yzong-rh · 2026-03-12T14:51:38Z

Great work! Glad you made structured outputs work.
cc @sfeng33

bbrowning

One note about handling tool call names where we may be guiding in a way that conflicts with the model's training, but otherwise this looks like a great overall improvement to guiding tool call output in auto mode. I'd like to get a few of these merged and in the wild so we can get real-world feedback on applying this type of guiding in auto mode across the board.

bbrowning · 2026-03-12T16:28:28Z

vllm/tool_parsers/kimi_k2_tool_parser.py

+            }
+            tags.append({
+                "type": "tag",
+                "begin": f"<|tool_call_begin|>{name}",


Don't we need to also handle functions.{name} here? From https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/docs/tool_call_guidance.md - "The tool ID and arguments are separated by <|tool_call_argument_begin|>. The format of the tool ID is functions.{func_name}:{idx}, from which we can parse the function name."

And in the example tool call parsing code given there:

# function_id: functions.get_weather:0 function_name = function_id.split('.')[1].split(':')[0]

You said this passes the K2 vendor verifier at 100%, so perhaps this doesn't matter. But, if we're forcing the model to omit the functions. prefix and it was trained to use that, then it would be better to follow exactly how the model was trained to output to minimize the overall impact of guiding.

Good catch! It should include functions.{name}, I think the parsing code doesn't enforce function., so I didn't run into any issue. I will update the code and rerun.

Hey @bbrowning, got the full 2000-sample eval results back on latest main (rebased) + function. added. With the structural tag guided decoding enabled:

668 tool calls, all 668 valid — 100% schema accuracy (up from 75.4% baseline)

Full results (summary + per-request JSONL): https://gist.github.com/ZhanqiuHu/63bf52dc445dee053e3ea9602bbda60e

sfeng33

Note - the structural tag is not supported in all guided decoding backend, it works now because it's using the default option - Xgrammar. In other cases, will it error out? If so, we should have a good way to handle it.

mergify · 2026-03-12T19:11:26Z

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

bbrowning · 2026-03-13T13:04:07Z

If we were following the Chat Completions and Responses APIs exactly, we'd only enable guided decoding for the json schema of a given function when its strict=True parameter is set in the function definition of the tools field in the Chat Completion / Responses request. I'm not sure if we want to handle that nuance here or not. I think that would mean always guiding the function name, but conditionally guiding the function call parameters only when strict is set to true.

This will be important if we want to consider doing this more broadly, but nothing I'd consider a blocker to merge this PR. This just felt like the audience to raise this awareness, as the defaults in these APIs is to let the user control whether we use structured outputs or not for function call generation.

yzong-rh · 2026-03-13T13:58:20Z

Some details of how to evaluate vLLM with Kimi K2 Vendor.
run.md in the fork contains how to get the baselines with vLLM v.0.17.1 and the most recent Kimi-K2-Instruct-0905.

mergify · 2026-03-15T23:59:47Z

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

ZhanqiuHu · 2026-03-16T13:41:12Z

If we were following the Chat Completions and Responses APIs exactly, we'd only enable guided decoding for the json schema of a given function when its strict=True parameter is set in the function definition of the tools field in the Chat Completion / Responses request. I'm not sure if we want to handle that nuance here or not. I think that would mean always guiding the function name, but conditionally guiding the function call parameters only when strict is set to true.

This will be important if we want to consider doing this more broadly, but nothing I'd consider a blocker to merge this PR. This just felt like the audience to raise this awareness, as the defaults in these APIs is to let the user control whether we use structured outputs or not for function call generation.

@bbrowning Good point! Noted in the PR description.

I looked into the relevant codes:
Seems like the OpenAI Python client's FunctionDefinition (source) includes strict:

class FunctionDefinition(BaseModel):
    name: str
    description: Optional[str] = None
    parameters: Optional[FunctionParameters] = None
    strict: Optional[bool] = None
    """Whether to enable strict schema adherence when generating the function call."""

While vLLM's FunctionDefinition vllm/entrypoints/openai/engine/protocol.py currently omits it:

class FunctionDefinition(OpenAIBaseModel):
    name: str
    description: str | None = None
    parameters: dict[str, Any] | None = None

Later we probably want to update protocol.py too and add the control flow with well-defined default behavior.

mergify · 2026-03-16T13:47:17Z

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Enable xgrammar structural tag enforcement for the Kimi K2 tool parser when tool_choice is "auto" or unset. This prevents tool name hallucination (e.g., model calling img_gen when only search/urls_fetch_tool are available) by constraining generation to only produce valid tool names and schema-compliant arguments. The structural tag uses xgrammar's TriggeredTagsFormat: - Free text until <|tool_call_begin|> trigger - Then constrained to: {tool_name}:\d+<|tool_call_argument_begin|>{json}<|tool_call_end|> - One tag per tool in the request, with JSON schema from tool parameters - Supports multiple tool calls per response Evaluation on K2-Vendor-Verifier (2000 samples): - Baseline: 167/678 schema errors (75.4% tool call accuracy) - With this change: 0/677 schema errors (100% tool call accuracy) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>

gaby · 2026-03-16T22:17:07Z

@ZhanqiuHu What about the hardcoded 1024 and 8192 bytes in the parser? When used with claude we get warning about 1024 buffer all the time.

ehfd · 2026-03-17T08:25:27Z

Does this fix #33654 ?

chaunceyjiang · 2026-03-17T10:36:32Z

vllm/tool_parsers/kimi_k2_tool_parser.py

+                    "begin": f"<|tool_call_begin|>functions.{name}",
+                    "content": {
+                        "type": "sequence",
+                        "elements": [


Thanks for this PR, @ZhanqiuHu, regarding the tool calling implementation.

structural_tag has been on the roadmap for quite some time. The main reason we haven't started working on it yet is that it currently only works well within xgrammar. Additionally, tool formats vary significantly across different models, which has slowed down progress.

Could you elaborate on your rationale for using regex, const_string, and json_schema simultaneously in this implementation?

Csrayz · 2026-03-23T07:45:33Z

We also need to consider speculative decoding scenarios, especially those involving the simultaneous generation of multiple tokens. In such cases, the tokens from the first round will or will not constrained by tags. @ZhanqiuHu

ZhanqiuHu · 2026-03-23T12:30:30Z

Thanks for the comments! I'm a bit tied up with other work right now, will take a look when I get a chance. Meanwhile, feel free to edit this PR or open a new one!

cc @yzong-rh in case you'd like to take a look or follow up on this. Thanks!

…ewline in tool call ID (vllm-project#38441) The model occasionally emits a stray \n between <|tool_call_begin|> and the function name, e.g.: <|tool_call_begin|> functions.edit:15<|tool_call_argument_begin|>{...} Because Python regex does not match \n with . by default, both stream_tool_call_portion_regex and stream_tool_call_name_regex silently failed to match, causing the entire tool call to be dropped during streaming. Fix: - Add a leading \s* to both streaming regexes so any leading whitespace/newlines before the tool_call_id are consumed. - Compile both regexes with re.DOTALL so . inside the capture group spans newlines. This is distinct from PR vllm-project#37384 which only adds re.DOTALL (without leading \s*) to the portion regex and does not fix stream_tool_call_name_regex. Tests added: - test_stream_tool_call_portion_regex_handles_leading_newline: unit test that both regexes match inputs with a leading \n. - test_streaming_tool_call_with_newline_after_begin_token: end-to-end streaming simulation reproducing the exact scenario in the issue. Why this is not a duplicate: checked open PRs vllm-project#37384, vllm-project#37445, vllm-project#32504, vllm-project#24847, vllm-project#26918, vllm-project#36891. None add the leading \s* prefix to handle whitespace/newlines preceding the tool_call_id capture group, and none fix stream_tool_call_name_regex with re.DOTALL. Co-authored-by: GitHub Copilot

ZhanqiuHu requested review from aarnphm and chaunceyjiang as code owners March 12, 2026 14:32

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

bbrowning reviewed Mar 12, 2026

View reviewed changes

sfeng33 reviewed Mar 12, 2026

View reviewed changes

ZhanqiuHu force-pushed the kimi-k2-guided-tool-choice-auto branch from 7255371 to ae2885b Compare March 12, 2026 18:31

ZhanqiuHu added 2 commits March 16, 2026 10:11

style: ruff format kimi_k2_tool_parser

b2519e3

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>

ZhanqiuHu force-pushed the kimi-k2-guided-tool-choice-auto branch from 6ec03c4 to b2519e3 Compare March 16, 2026 14:11

sfeng33 mentioned this pull request Mar 16, 2026

[Bug]: Kimi-K2.5 Tool Parser Critical Issues - 87% Accuracy, 8KB Limit, Token Leakage #37184

Open

7 tasks

chaunceyjiang self-assigned this Mar 17, 2026

chaunceyjiang reviewed Mar 17, 2026

View reviewed changes

This was referenced Mar 17, 2026

[Responses API] Unified tool_choice + structured output via triggered tags will-deines/vllm#1

Closed

[Responses API] tool_choice support (auto / required / none) for GPT-OSS #37433

Open

mergify bot added the tool-calling label Mar 20, 2026

github-project-automation bot added this to Tool Calling Mar 20, 2026

coderabbitai bot mentioned this pull request Mar 27, 2026

docs: update project direction based on ablation experiments let-sunny/canicode#122

Merged

3 tasks

saifmb0 mentioned this pull request Mar 28, 2026

[Bugfix][Tool Parser] Fix Kimi-K2 streaming regex to handle leading newline before tool call ID #38443

Open

Uh oh!

Conversation

ZhanqiuHu commented Mar 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Summary

Approach

Evaluation

Caveats/Limitations

Future work

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ZhanqiuHu Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

bbrowning Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

yzong-rh commented Mar 12, 2026

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

bbrowning Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ZhanqiuHu Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ZhanqiuHu Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfeng33 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

bbrowning commented Mar 13, 2026

Uh oh!

yzong-rh commented Mar 13, 2026

Uh oh!

mergify bot commented Mar 15, 2026

Uh oh!

ZhanqiuHu commented Mar 16, 2026

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

gaby commented Mar 16, 2026

Uh oh!

ehfd commented Mar 17, 2026

Uh oh!

chaunceyjiang Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Csrayz commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZhanqiuHu commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ZhanqiuHu commented Mar 12, 2026 •

edited by github-actions bot

Loading

ZhanqiuHu Mar 13, 2026 •

edited

Loading

Csrayz commented Mar 23, 2026 •

edited

Loading

ZhanqiuHu commented Mar 23, 2026 •

edited

Loading