Skip to content

[Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode#28543

Merged
chaunceyjiang merged 1 commit intovllm-project:mainfrom
jscaldwell55:fix/kimi-k2-tool-parser-token-leak
Nov 17, 2025
Merged

[Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode#28543
chaunceyjiang merged 1 commit intovllm-project:mainfrom
jscaldwell55:fix/kimi-k2-tool-parser-token-leak

Conversation

@jscaldwell55
Copy link
Contributor

@jscaldwell55 jscaldwell55 commented Nov 12, 2025

Summary

This PR fixes a bug in the streaming output routing for MoonshotAI/Kimi-K2#89 where special tokens and intermediate text can leak into the reasoning_delta field during streaming mode (stream=True) before the tool section is fully detected.

As identified in the issue and comment, agent frameworks like LangChain and AutoGPT fail when internal markers (e.g., <|tool_calls_section_begin|>) appear in reasoning_delta. This fix ensures reasoning_delta contains only natural-language reasoning text, aligning with the expected format from the Kimi K2 paper (Appendix B) and downstream SDKs.


The Problem

The parser's extract_tool_calls_streaming method lacked section-level state management, causing text between <|tool_calls_section_begin|> and <|tool_call_begin|> to emit as reasoning content when it should be suppressed.

Specific Leak Scenario

When the model streams output like:

"Reasoning... <|tool_calls_section_begin|> spurious text <|tool_call_begin|>..."

The deltas are processed as:

  1. "Reasoning... "DeltaMessage(content="Reasoning... ") (correct)
  2. "<|tool_calls_section_begin|>" → stripped, no output (correct)
  3. " spurious text "DeltaMessage(content=" spurious text ") (LEAKED!)
  4. "<|tool_call_begin|>..." → starts tool call parsing (correct)

Prior Behavior

The original condition at line 156-162 checked if tool call counts were balanced:

if cur_tool_start_count == cur_tool_end_count:  # Both are 0!
    return DeltaMessage(content=delta_text)  # ← LEAKS TEXT

This incorrectly assumed "balanced counts = reasoning mode", failing to account for being inside the tool section but before any tool call begins.


Solution

Core Changes

1. Added Section-Level State Machine

  • in_tool_section: bool flag to track if we're between <|tool_calls_section_begin|> and <|tool_calls_section_end|>
  • Explicit state transitions: REASONING ↔ TOOL_SECTION ↔ TOOL_CALL_ACTIVE

2. Implemented Rolling Buffer for Split Markers

  • token_buffer: str accumulates deltas to detect markers split across chunks
  • Example: "<|tool_calls_sec" + "tion_begin|>" → correctly detected
  • Buffer size limited to 1024 bytes with overflow protection (empirical worst-case for longest marker (~30 chars) * 2 + safety margin for unicode + partial overlap)
  • Added _buffer_overflow_logged flag to log overflow warning only once

3. Content Suppression Logic

if self.in_tool_section and cur_tool_start_count == 0:
    logger.debug("In tool section but no tool calls started yet. Suppressing: %s", delta_text)
    return DeltaMessage(content="")  # Suppresses leak while preserving return type

4. Marker Variant Support

  • Supports both <|tool_calls_section_begin|> (plural) and <|tool_call_section_begin|> (singular)
  • Handles potential format variations from model output

5. Error Recovery

  • Tracks section_char_count to detect malformed tool sections
  • Force-exits tool section if it exceeds 8192 chars without proper structure
  • Prevents indefinite content suppression from malformed output

6. State Reset Mechanism

  • Added reset_streaming_state() public method
  • Clears all state between requests to prevent leakage when parser is reused

7. Function Contract Preservation

  • Changed suppression logic to return DeltaMessage(content="") instead of None
  • Maintains consistent return type for downstream iterator patterns
  • Prevents potential breaking changes for consumers expecting DeltaMessage

Files Changed

Modified

vllm/entrypoints/openai/tool_parsers/kimi_k2_tool_parser.py

  • Added section-level state variables (in_tool_section, token_buffer, section_char_count, buffer_max_size, max_section_chars, _buffer_overflow_logged)
  • Implemented _check_and_strip_markers() helper for buffer processing
  • Added _reset_section_state() and reset_streaming_state() methods
  • Updated extract_tool_calls_streaming() with:
    • Buffer management (lines 187-266)
    • Section state transitions (lines 222-236)
    • Content suppression checks (lines 254-260, 343-346)
    • Error recovery (lines 255-266)

tests/tool_use/test_kimi_k2_tool_parser.py

  • Added 9 new test cases covering:
    • test_token_leak_between_section_and_tool_begin() - Main bug: leak prevention
    • test_split_markers_across_deltas() - Buffer functionality
    • test_marker_variants() - Singular/plural support
    • test_reentry_to_reasoning_after_tool_section() - State transitions
    • test_empty_tool_section() - Edge case
    • test_malformed_tool_section_recovery() - Error recovery
    • test_state_reset() - State management
    • test_section_begin_noise_tool_begin_same_chunk() - Same-chunk suppression
    • test_stream_ends_without_section_end_marker() - EOF handling

Testing

  • Added 9 new unit tests in tests/tool_use/test_kimi_k2_tool_parser.py (pytest-based)
  • Ran targeted tests locally: pytest -s -v tests/tool_use/test_kimi_k2_tool_parser.py – all pass
  • Ran broader suite: pytest -s -v tests/tool_use/ – no regressions
  • Verified on CPU; awaiting CI for GPU validation (as not all tests pass locally on CPU per guidelines)
  • Tests cover main leak bug, split markers, variants, state transitions, edge cases (empty/malformed), same-chunk suppression, EOF handling, and error recovery

Scope

  • Affects: Only KimiK2ToolParser in streaming mode
  • No impact: Other tool parsers, non-streaming mode, non-K2 models

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the token leakage bug by introducing a state machine and a buffer to manage streaming output. The changes are well-structured and accompanied by a comprehensive new test suite. However, I've identified a critical issue in the new state transition logic. It fails to handle cases where a tool section both begins and ends within the same data chunk, which can leave the parser in a corrupted state. I've provided a comment with a suggested fix for this logic and recommended adding a new test case to cover this scenario. Addressing this is crucial to ensure the fix is fully robust.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@jscaldwell55 jscaldwell55 force-pushed the fix/kimi-k2-tool-parser-token-leak branch from d193e17 to 0427a6c Compare November 12, 2025 10:28
@mergify
Copy link

mergify bot commented Nov 12, 2025

Documentation preview: https://vllm--28543.org.readthedocs.build/en/28543/

@mergify mergify bot added the documentation Improvements or additions to documentation label Nov 12, 2025
@jscaldwell55 jscaldwell55 force-pushed the fix/kimi-k2-tool-parser-token-leak branch from 0427a6c to e188c9b Compare November 12, 2025 10:30
- Add section-level state machine (in_tool_section flag)
- Implement rolling buffer for split marker detection (1KB cap)
- Suppress content between section_begin and tool_call_begin
- Support marker variants (plural/singular)
- Add error recovery for malformed sections (8KB limit)
- Preserve function contract (always return DeltaMessage)
- Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk
  (Changed elif to if on line 237 to prevent state corruption)
- Fix critical bug vllm-project#2: Defer section exit when tool_call_end present
  (Prevents dropping final tool arguments and token leakage)
- Include 12 comprehensive tests (3 new tests for edge cases)

Fixes bug where text between <|tool_calls_section_begin|> and
<|tool_call_begin|> leaks into reasoning_delta during streaming mode.

Also fixes two critical edge cases:
1. Section begin and end markers appearing in same chunk would leave
   parser stuck in in_tool_section=True, causing subsequent content
   to be incorrectly suppressed.
2. Tool_call_end and section_end in same chunk would cause early
   return before tool parsing, dropping final tool arguments and
   leaking special tokens into reasoning channel.

Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
@jscaldwell55 jscaldwell55 force-pushed the fix/kimi-k2-tool-parser-token-leak branch from e188c9b to c3a801a Compare November 12, 2025 10:36
@chaunceyjiang
Copy link
Collaborator

@jscaldwell55 Thanks~. LGTM.

Could you help test #24847 again?

@chaunceyjiang chaunceyjiang self-assigned this Nov 13, 2025
@jscaldwell55
Copy link
Contributor Author

jscaldwell55 commented Nov 13, 2025

@chaunceyjiang

Tests were successful.

Run on Python 3.12.7:

pytest tests/tool_use/test_kimi_k2_tool_parser.py -v

Result: All 12 tests passed ✅
- All existing tests continue to pass
- 4 new tests for concatenated tool calls work correctly
- No regressions detected

✅ Compatibility Testing with PR #28543

Created a temporary branch merging both PRs to test for conflicts/interactions:
- PR #24847 Fixes concatenated tool calls regex
- PR #28543 Prevents token leakage in streaming mode

Result: All 24 combined tests passed ✅

As far as I can tell, both PRs can be safely merged.

@chaunceyjiang
Copy link
Collaborator

@jscaldwell55 Thanks~

@chaunceyjiang chaunceyjiang enabled auto-merge (squash) November 17, 2025 05:54
Copy link
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks~

@chaunceyjiang chaunceyjiang merged commit 6f37419 into vllm-project:main Nov 17, 2025
47 checks passed
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…eaming mode (vllm-project#28543)

Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…eaming mode (vllm-project#28543)

Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
@regismesquita
Copy link

Seems like it is also happening on Kimi K2.5

@jscaldwell55
Copy link
Contributor Author

@regismesquita Interesting, thanks for heads-up. Will take a look

@jscaldwell55
Copy link
Contributor Author

Looks like they fixed it on the kimi end, might be caused from an issue somewhere else in the pipeline

jscaldwell55 added a commit to jscaldwell55/vllm that referenced this pull request Mar 18, 2026
…y streaming state

The original streaming fix (PR vllm-project#28543) introduced a hard-coded 8KB
section limit that truncates large tool call arguments, breaking coding
use cases with Kimi-K2 and K2.5 models. This rewrite addresses the
regression while preserving all existing behavior.

Changes:
- Replace hard-coded 8KB limit with configurable 512KB default via
  VLLM_KIMI_TOOL_PARSER_MAX_SECTION_CHARS environment variable
- Consolidate 6 scattered instance variables into _StreamState dataclass
- Replace 7 copy-pasted deferred section exit checks with single
  try/finally cleanup
- Reduce rolling buffer from 1KB to 256 bytes (longest marker is 28
  chars)
- Add regression tests for large arguments, configurable limits,
  multi-turn reentry, and thinking+tools interleaving

Signed-off-by: Jay Caldwell <jay.s.caldwell@gmail.com>
erkintelnyx pushed a commit to erkintelnyx/vllm that referenced this pull request Mar 18, 2026
…y streaming state

The original streaming fix (PR vllm-project#28543) introduced a hard-coded 8KB
section limit that truncates large tool call arguments, breaking coding
use cases with Kimi-K2 and K2.5 models. This rewrite addresses the
regression while preserving all existing behavior.

Changes:
- Replace hard-coded 8KB limit with configurable 512KB default via
  VLLM_KIMI_TOOL_PARSER_MAX_SECTION_CHARS environment variable
- Consolidate 6 scattered instance variables into _StreamState dataclass
- Replace 7 copy-pasted deferred section exit checks with single
  try/finally cleanup
- Reduce rolling buffer from 1KB to 256 bytes (longest marker is 28
  chars)
- Add regression tests for large arguments, configurable limits,
  multi-turn reentry, and thinking+tools interleaving

Signed-off-by: Jay Caldwell <jay.s.caldwell@gmail.com>
erkintelnyx pushed a commit to erkintelnyx/vllm that referenced this pull request Mar 19, 2026
…y streaming state

The original streaming fix (PR vllm-project#28543) introduced a hard-coded 8KB
section limit that truncates large tool call arguments, breaking coding
use cases with Kimi-K2 and K2.5 models. This rewrite addresses the
regression while preserving all existing behavior.

Changes:
- Replace hard-coded 8KB limit with configurable 512KB default via
  VLLM_KIMI_TOOL_PARSER_MAX_SECTION_CHARS environment variable
- Consolidate 6 scattered instance variables into _StreamState dataclass
- Replace 7 copy-pasted deferred section exit checks with single
  try/finally cleanup
- Reduce rolling buffer from 1KB to 256 bytes (longest marker is 28
  chars)
- Add regression tests for large arguments, configurable limits,
  multi-turn reentry, and thinking+tools interleaving

Signed-off-by: Jay Caldwell <jay.s.caldwell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants