Skip to content

[Tool Parser] Kimi K2: guided decoding for tool_choice="auto" — 75% → 100% schema accuracy#36891

Open
ZhanqiuHu wants to merge 2 commits intovllm-project:mainfrom
ZhanqiuHu:kimi-k2-guided-tool-choice-auto
Open

[Tool Parser] Kimi K2: guided decoding for tool_choice="auto" — 75% → 100% schema accuracy#36891
ZhanqiuHu wants to merge 2 commits intovllm-project:mainfrom
ZhanqiuHu:kimi-k2-guided-tool-choice-auto

Conversation

@ZhanqiuHu
Copy link
Copy Markdown
Contributor

@ZhanqiuHu ZhanqiuHu commented Mar 12, 2026

Co-authored with @yzong-rh

Purpose

The Kimi K2 tool parser currently relies on post-hoc parsing for tool_choice="auto" — the model generates freely and vLLM extracts tool calls afterward. This works most of the time, but the model can hallucinate tool names not in the user's schema (e.g., calling img_gen when only search is available), causing schema validation failures.

This PR adds generation-time enforcement via xgrammar's structural tag mechanism, ensuring that once the model decides to make a tool call, it can only produce tool names and arguments that conform to the provided schema. This is the first tool parser in vLLM to use guided decoding for tool_choice="auto".

For background on Kimi K2 tool calling on vLLM, see: Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM.

Key benefits:

  • 100% schema accuracy on the K2-Vendor-Verifier benchmark (up from 75.4%), eliminating all tool name hallucination
  • Zero overhead for non-tool-call tokens — the grammar only activates after the <|tool_call_begin|> trigger, so free-text generation is unconstrained
  • Composable with existing behaviortool_choice="required" and forced function still use the base class JSON schema path; this only fills the gap for "auto"
  • Generalizable pattern — the same TriggeredTagsFormat approach can be applied to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination issues

Summary

  • This is the first tool parser in vLLM to apply guided decoding for tool_choice="auto", and the approach generalizes to other parsers
  • Add xgrammar structural tag guided decoding to the Kimi K2 tool parser when tool_choice is "auto" or unset
  • Eliminates tool name hallucination (e.g., model calling img_gen when only search/urls_fetch_tool are available) by constraining generation at the token level
  • No change to tool_choice="required" or forced function behavior (handled by base class)

Approach

Override adjust_request() in KimiK2ToolParser to build a TriggeredTagsFormat structural tag from the request's tool definitions:

  • Trigger: <|tool_call_begin|> — free text allowed until this token
  • Per-tool tag: <|tool_call_begin|>{name}:\d+<|tool_call_argument_begin|>{json}<|tool_call_end|>
  • Composable content: sequence of regex (call ID) + const_string (argument marker) + json_schema (parameters)
  • Supports multiple tool calls per response (stop_after_first=False)
  • Respects existing structured_outputs if already set (e.g., by tool_choice="required")

Evaluation

K2-Vendor-Verifier benchmark, 2000 samples, moonshotai/Kimi-K2-Instruct-0905 (revision 94a4053eb8863059dd8afc00937f054e1365abbd):

Tool Calls Schema Errors Accuracy
Baseline (no guided decoding) 678 167 75.4%
This PR 677 0 100%

The dominant failure mode in the baseline was tool name hallucination — the model generating calls to tools not in the provided schema (e.g., img_gen). With structural tag enforcement, the grammar only allows tokens that match valid tool names after the <|tool_call_begin|> trigger.

Reproduction:

# Server
vllm serve moonshotai/Kimi-K2-Instruct-0905 \
  --revision 94a4053eb8863059dd8afc00937f054e1365abbd \ # changing this might results in regression, still verifying
  --tensor-parallel-size 8 --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser kimi_k2

# Eval (using K2-Vendor-Verifier)
python tool_calls_eval.py downloads/tool-calls/samples.jsonl \
  --model moonshotai/Kimi-K2-Instruct-0905 \
  --base-url http://localhost:8000/v1 --api-key dummy \
  --concurrency 8 --temperature 0.6 --max-tokens 64000 \
  --output results.jsonl --summary summary.json

Caveats/Limitations

  • Performance not benchmarked — throughput/latency overhead of structural tag guided decoding has not been measured. The grammar only constrains tokens inside tool calls (not free text), so overhead should be minimal, but this needs validation.

Future work

  • Integrate per-function strict parameter for argument schema guidance (add strict to FunctionDefinition in vLLM's protocol layer first).
  • Generalize this approach to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination in tool_choice="auto"
  • Validate tool_choice='required' path.
  • Benchmark throughput/latency overhead of structural tag guided decoding vs. unconstrained generation

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces guided decoding for the Kimi K2 tool parser when tool_choice is 'auto', significantly improving schema accuracy by preventing tool name hallucination. The implementation uses xggrammar's structural tags to constrain generation. The changes are well-targeted for the 'auto' use case. However, I've identified a critical issue where tool_choice='required' is likely non-functional due to an incompatibility between the base class's guidance mechanism and this parser's expectation of special tokens. I've left a comment with details on the issue and a suggestion for a fix.

Comment on lines +109 to +127
def adjust_request(
self, request: ChatCompletionRequest
) -> ChatCompletionRequest:
request = super().adjust_request(request)

if request.structured_outputs is not None:
return request

if request.tools and request.tool_choice in ("auto", None):
tag_json = self._build_structural_tag(request.tools)
if tag_json is not None:
request.structured_outputs = StructuredOutputsParams(
structural_tag=tag_json
)

if request.tools and request.tool_choice != "none":
request.skip_special_tokens = False

return request
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While this implementation correctly handles tool_choice='auto', it appears that tool_choice='required' may be broken. For required mode, super().adjust_request() is called, which sets a plain JSON schema constraint. This causes the model to generate raw JSON, without the special tokens (<|tool_call_begin|>, etc.) that this parser's extract_tool_calls method expects. The early return if request.structured_outputs is not None: prevents the new structural tag logic from being applied.

This likely results in required tool calls failing to be parsed. To fix this, you could handle required mode here using structural tags, similar to how you've handled auto. This would involve modifying _build_structural_tag to set "at_least_one": True when tool_choice is "required".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should not touch the original path, but I haven't verified the =required path. Will note that in the description as a future work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there's no need to handle this as part of this PR. The general problem Gemini is pointing out is that for tool_choice='required' we always guide the model to produce JSON as opposed to producing the model-specific tool calling format. That's a more general problem we need to solve, larger in scope than just this change.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 12, 2026

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@yzong-rh
Copy link
Copy Markdown
Contributor

Great work! Glad you made structured outputs work.
cc @sfeng33

Copy link
Copy Markdown
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note about handling tool call names where we may be guiding in a way that conflicts with the model's training, but otherwise this looks like a great overall improvement to guiding tool call output in auto mode. I'd like to get a few of these merged and in the wild so we can get real-world feedback on applying this type of guiding in auto mode across the board.

}
tags.append({
"type": "tag",
"begin": f"<|tool_call_begin|>{name}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to also handle functions.{name} here? From https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/docs/tool_call_guidance.md - "The tool ID and arguments are separated by <|tool_call_argument_begin|>. The format of the tool ID is functions.{func_name}:{idx}, from which we can parse the function name."

And in the example tool call parsing code given there:

        # function_id: functions.get_weather:0
        function_name = function_id.split('.')[1].split(':')[0]

You said this passes the K2 vendor verifier at 100%, so perhaps this doesn't matter. But, if we're forcing the model to omit the functions. prefix and it was trained to use that, then it would be better to follow exactly how the model was trained to output to minimize the overall impact of guiding.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It should include functions.{name}, I think the parsing code doesn't enforce function., so I didn't run into any issue. I will update the code and rerun.

Copy link
Copy Markdown
Contributor Author

@ZhanqiuHu ZhanqiuHu Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bbrowning, got the full 2000-sample eval results back on latest main (rebased) + function. added. With the structural tag guided decoding enabled:

  • 668 tool calls, all 668 valid — 100% schema accuracy (up from 75.4% baseline)

Full results (summary + per-request JSONL): https://gist.github.com/ZhanqiuHu/63bf52dc445dee053e3ea9602bbda60e

Copy link
Copy Markdown
Contributor

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note - the structural tag is not supported in all guided decoding backend, it works now because it's using the default option - Xgrammar. In other cases, will it error out? If so, we should have a good way to handle it.

@ZhanqiuHu ZhanqiuHu force-pushed the kimi-k2-guided-tool-choice-auto branch from 7255371 to ae2885b Compare March 12, 2026 18:31
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 12, 2026

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@bbrowning
Copy link
Copy Markdown
Contributor

If we were following the Chat Completions and Responses APIs exactly, we'd only enable guided decoding for the json schema of a given function when its strict=True parameter is set in the function definition of the tools field in the Chat Completion / Responses request. I'm not sure if we want to handle that nuance here or not. I think that would mean always guiding the function name, but conditionally guiding the function call parameters only when strict is set to true.

This will be important if we want to consider doing this more broadly, but nothing I'd consider a blocker to merge this PR. This just felt like the audience to raise this awareness, as the defaults in these APIs is to let the user control whether we use structured outputs or not for function call generation.

@yzong-rh
Copy link
Copy Markdown
Contributor

Some details of how to evaluate vLLM with Kimi K2 Vendor.
run.md in the fork contains how to get the baselines with vLLM v.0.17.1 and the most recent Kimi-K2-Instruct-0905.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 15, 2026

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@ZhanqiuHu
Copy link
Copy Markdown
Contributor Author

If we were following the Chat Completions and Responses APIs exactly, we'd only enable guided decoding for the json schema of a given function when its strict=True parameter is set in the function definition of the tools field in the Chat Completion / Responses request. I'm not sure if we want to handle that nuance here or not. I think that would mean always guiding the function name, but conditionally guiding the function call parameters only when strict is set to true.

This will be important if we want to consider doing this more broadly, but nothing I'd consider a blocker to merge this PR. This just felt like the audience to raise this awareness, as the defaults in these APIs is to let the user control whether we use structured outputs or not for function call generation.

@bbrowning Good point! Noted in the PR description.

I looked into the relevant codes:
Seems like the OpenAI Python client's FunctionDefinition (source) includes strict:

class FunctionDefinition(BaseModel):
    name: str
    description: Optional[str] = None
    parameters: Optional[FunctionParameters] = None
    strict: Optional[bool] = None
    """Whether to enable strict schema adherence when generating the function call."""

While vLLM's FunctionDefinition vllm/entrypoints/openai/engine/protocol.py currently omits it:

class FunctionDefinition(OpenAIBaseModel):
    name: str
    description: str | None = None
    parameters: dict[str, Any] | None = None

Later we probably want to update protocol.py too and add the control flow with well-defined default behavior.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 16, 2026

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Enable xgrammar structural tag enforcement for the Kimi K2 tool parser
when tool_choice is "auto" or unset. This prevents tool name
hallucination (e.g., model calling img_gen when only search/urls_fetch_tool
are available) by constraining generation to only produce valid tool names
and schema-compliant arguments.

The structural tag uses xgrammar's TriggeredTagsFormat:
- Free text until <|tool_call_begin|> trigger
- Then constrained to: {tool_name}:\d+<|tool_call_argument_begin|>{json}<|tool_call_end|>
- One tag per tool in the request, with JSON schema from tool parameters
- Supports multiple tool calls per response

Evaluation on K2-Vendor-Verifier (2000 samples):
- Baseline: 167/678 schema errors (75.4% tool call accuracy)
- With this change: 0/677 schema errors (100% tool call accuracy)

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
@gaby
Copy link
Copy Markdown

gaby commented Mar 16, 2026

@ZhanqiuHu What about the hardcoded 1024 and 8192 bytes in the parser? When used with claude we get warning about 1024 buffer all the time.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Mar 17, 2026

Does this fix #33654 ?

@chaunceyjiang chaunceyjiang self-assigned this Mar 17, 2026
"begin": f"<|tool_call_begin|>functions.{name}",
"content": {
"type": "sequence",
"elements": [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, @ZhanqiuHu, regarding the tool calling implementation.

structural_tag has been on the roadmap for quite some time. The main reason we haven't started working on it yet is that it currently only works well within xgrammar. Additionally, tool formats vary significantly across different models, which has slowed down progress.

Could you elaborate on your rationale for using regex, const_string, and json_schema simultaneously in this implementation?

@Csrayz
Copy link
Copy Markdown
Contributor

Csrayz commented Mar 23, 2026

We also need to consider speculative decoding scenarios, especially those involving the simultaneous generation of multiple tokens. In such cases, the tokens from the first round will or will not constrained by tags. @ZhanqiuHu

@ZhanqiuHu
Copy link
Copy Markdown
Contributor Author

ZhanqiuHu commented Mar 23, 2026

Thanks for the comments! I'm a bit tied up with other work right now, will take a look when I get a chance. Meanwhile, feel free to edit this PR or open a new one!

cc @yzong-rh in case you'd like to take a look or follow up on this. Thanks!

saifmb0 added a commit to saifmb0/vllm that referenced this pull request Mar 28, 2026
…ewline in tool call ID (vllm-project#38441)

The model occasionally emits a stray \n between <|tool_call_begin|>
and the function name, e.g.:

    <|tool_call_begin|>
    functions.edit:15<|tool_call_argument_begin|>{...}

Because Python regex does not match \n with . by default, both
stream_tool_call_portion_regex and stream_tool_call_name_regex
silently failed to match, causing the entire tool call to be dropped
during streaming.

Fix:
- Add a leading \s* to both streaming regexes so any leading
  whitespace/newlines before the tool_call_id are consumed.
- Compile both regexes with re.DOTALL so . inside the capture group
  spans newlines.

This is distinct from PR vllm-project#37384 which only adds re.DOTALL (without
leading \s*) to the portion regex and does not fix stream_tool_call_name_regex.

Tests added:
- test_stream_tool_call_portion_regex_handles_leading_newline: unit
  test that both regexes match inputs with a leading \n.
- test_streaming_tool_call_with_newline_after_begin_token: end-to-end
  streaming simulation reproducing the exact scenario in the issue.

Why this is not a duplicate: checked open PRs vllm-project#37384, vllm-project#37445, vllm-project#32504,
vllm-project#24847, vllm-project#26918, vllm-project#36891. None add the leading \s* prefix to handle
whitespace/newlines preceding the tool_call_id capture group, and none
fix stream_tool_call_name_regex with re.DOTALL.

Co-authored-by: GitHub Copilot
saifmb0 added a commit to saifmb0/vllm that referenced this pull request Mar 28, 2026
…ewline in tool call ID (vllm-project#38441)

The model occasionally emits a stray \n between <|tool_call_begin|>
and the function name, e.g.:

    <|tool_call_begin|>
    functions.edit:15<|tool_call_argument_begin|>{...}

Because Python regex does not match \n with . by default, both
stream_tool_call_portion_regex and stream_tool_call_name_regex
silently failed to match, causing the entire tool call to be dropped
during streaming.

Fix:
- Add a leading \s* to both streaming regexes so any leading
  whitespace/newlines before the tool_call_id are consumed.
- Compile both regexes with re.DOTALL so . inside the capture group
  spans newlines.

This is distinct from PR vllm-project#37384 which only adds re.DOTALL (without
leading \s*) to the portion regex and does not fix stream_tool_call_name_regex.

Tests added:
- test_stream_tool_call_portion_regex_handles_leading_newline: unit
  test that both regexes match inputs with a leading \n.
- test_streaming_tool_call_with_newline_after_begin_token: end-to-end
  streaming simulation reproducing the exact scenario in the issue.

Why this is not a duplicate: checked open PRs vllm-project#37384, vllm-project#37445, vllm-project#32504,
vllm-project#24847, vllm-project#26918, vllm-project#36891. None add the leading \s* prefix to handle
whitespace/newlines preceding the tool_call_id capture group, and none
fix stream_tool_call_name_regex with re.DOTALL.

Co-authored-by: GitHub Copilot
saifmb0 added a commit to saifmb0/vllm that referenced this pull request Mar 28, 2026
…ewline in tool call ID (vllm-project#38441)

The model occasionally emits a stray \n between <|tool_call_begin|>
and the function name, e.g.:

    <|tool_call_begin|>
    functions.edit:15<|tool_call_argument_begin|>{...}

Because Python regex does not match \n with . by default, both
stream_tool_call_portion_regex and stream_tool_call_name_regex
silently failed to match, causing the entire tool call to be dropped
during streaming.

Fix:
- Add a leading \s* to both streaming regexes so any leading
  whitespace/newlines before the tool_call_id are consumed.
- Compile both regexes with re.DOTALL so . inside the capture group
  spans newlines.

This is distinct from PR vllm-project#37384 which only adds re.DOTALL (without
leading \s*) to the portion regex and does not fix stream_tool_call_name_regex.

Tests added:
- test_stream_tool_call_portion_regex_handles_leading_newline: unit
  test that both regexes match inputs with a leading \n.
- test_streaming_tool_call_with_newline_after_begin_token: end-to-end
  streaming simulation reproducing the exact scenario in the issue.

Why this is not a duplicate: checked open PRs vllm-project#37384, vllm-project#37445, vllm-project#32504,
vllm-project#24847, vllm-project#26918, vllm-project#36891. None add the leading \s* prefix to handle
whitespace/newlines preceding the tool_call_id capture group, and none
fix stream_tool_call_name_regex with re.DOTALL.

Co-authored-by: GitHub Copilot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

8 participants