[Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode#28543
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the token leakage bug by introducing a state machine and a buffer to manage streaming output. The changes are well-structured and accompanied by a comprehensive new test suite. However, I've identified a critical issue in the new state transition logic. It fails to handle cases where a tool section both begins and ends within the same data chunk, which can leave the parser in a corrupted state. I've provided a comment with a suggested fix for this logic and recommended adding a new test case to cover this scenario. Addressing this is crucial to ensure the fix is fully robust.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
d193e17 to
0427a6c
Compare
|
Documentation preview: https://vllm--28543.org.readthedocs.build/en/28543/ |
0427a6c to
e188c9b
Compare
- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
e188c9b to
c3a801a
Compare
|
@jscaldwell55 Thanks~. LGTM. Could you help test #24847 again? |
|
Tests were successful. Run on Python 3.12.7: pytest tests/tool_use/test_kimi_k2_tool_parser.py -v
Result: All 12 tests passed ✅
- All existing tests continue to pass
- 4 new tests for concatenated tool calls work correctly
- No regressions detected
✅ Compatibility Testing with PR #28543
Created a temporary branch merging both PRs to test for conflicts/interactions:
- PR #24847 Fixes concatenated tool calls regex
- PR #28543 Prevents token leakage in streaming mode
Result: All 24 combined tests passed ✅
As far as I can tell, both PRs can be safely merged. |
|
@jscaldwell55 Thanks~ |
…eaming mode (vllm-project#28543) Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
…eaming mode (vllm-project#28543) Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
|
Seems like it is also happening on Kimi K2.5 |
|
@regismesquita Interesting, thanks for heads-up. Will take a look |
|
Looks like they fixed it on the kimi end, might be caused from an issue somewhere else in the pipeline |
…y streaming state The original streaming fix (PR vllm-project#28543) introduced a hard-coded 8KB section limit that truncates large tool call arguments, breaking coding use cases with Kimi-K2 and K2.5 models. This rewrite addresses the regression while preserving all existing behavior. Changes: - Replace hard-coded 8KB limit with configurable 512KB default via VLLM_KIMI_TOOL_PARSER_MAX_SECTION_CHARS environment variable - Consolidate 6 scattered instance variables into _StreamState dataclass - Replace 7 copy-pasted deferred section exit checks with single try/finally cleanup - Reduce rolling buffer from 1KB to 256 bytes (longest marker is 28 chars) - Add regression tests for large arguments, configurable limits, multi-turn reentry, and thinking+tools interleaving Signed-off-by: Jay Caldwell <jay.s.caldwell@gmail.com>
…y streaming state The original streaming fix (PR vllm-project#28543) introduced a hard-coded 8KB section limit that truncates large tool call arguments, breaking coding use cases with Kimi-K2 and K2.5 models. This rewrite addresses the regression while preserving all existing behavior. Changes: - Replace hard-coded 8KB limit with configurable 512KB default via VLLM_KIMI_TOOL_PARSER_MAX_SECTION_CHARS environment variable - Consolidate 6 scattered instance variables into _StreamState dataclass - Replace 7 copy-pasted deferred section exit checks with single try/finally cleanup - Reduce rolling buffer from 1KB to 256 bytes (longest marker is 28 chars) - Add regression tests for large arguments, configurable limits, multi-turn reentry, and thinking+tools interleaving Signed-off-by: Jay Caldwell <jay.s.caldwell@gmail.com>
…y streaming state The original streaming fix (PR vllm-project#28543) introduced a hard-coded 8KB section limit that truncates large tool call arguments, breaking coding use cases with Kimi-K2 and K2.5 models. This rewrite addresses the regression while preserving all existing behavior. Changes: - Replace hard-coded 8KB limit with configurable 512KB default via VLLM_KIMI_TOOL_PARSER_MAX_SECTION_CHARS environment variable - Consolidate 6 scattered instance variables into _StreamState dataclass - Replace 7 copy-pasted deferred section exit checks with single try/finally cleanup - Reduce rolling buffer from 1KB to 256 bytes (longest marker is 28 chars) - Add regression tests for large arguments, configurable limits, multi-turn reentry, and thinking+tools interleaving Signed-off-by: Jay Caldwell <jay.s.caldwell@gmail.com>
Summary
This PR fixes a bug in the streaming output routing for MoonshotAI/Kimi-K2#89 where special tokens and intermediate text can leak into the
reasoning_deltafield during streaming mode (stream=True) before the tool section is fully detected.As identified in the issue and comment, agent frameworks like LangChain and AutoGPT fail when internal markers (e.g.,
<|tool_calls_section_begin|>) appear inreasoning_delta. This fix ensuresreasoning_deltacontains only natural-language reasoning text, aligning with the expected format from the Kimi K2 paper (Appendix B) and downstream SDKs.The Problem
The parser's
extract_tool_calls_streamingmethod lacked section-level state management, causing text between<|tool_calls_section_begin|>and<|tool_call_begin|>to emit as reasoning content when it should be suppressed.Specific Leak Scenario
When the model streams output like:
The deltas are processed as:
"Reasoning... "→DeltaMessage(content="Reasoning... ")(correct)"<|tool_calls_section_begin|>"→ stripped, no output (correct)" spurious text "→DeltaMessage(content=" spurious text ")(LEAKED!)"<|tool_call_begin|>..."→ starts tool call parsing (correct)Prior Behavior
The original condition at line 156-162 checked if tool call counts were balanced:
This incorrectly assumed "balanced counts = reasoning mode", failing to account for being inside the tool section but before any tool call begins.
Solution
Core Changes
1. Added Section-Level State Machine
in_tool_section: boolflag to track if we're between<|tool_calls_section_begin|>and<|tool_calls_section_end|>REASONING ↔ TOOL_SECTION ↔ TOOL_CALL_ACTIVE2. Implemented Rolling Buffer for Split Markers
token_buffer: straccumulates deltas to detect markers split across chunks"<|tool_calls_sec"+"tion_begin|>"→ correctly detected_buffer_overflow_loggedflag to log overflow warning only once3. Content Suppression Logic
4. Marker Variant Support
<|tool_calls_section_begin|>(plural) and<|tool_call_section_begin|>(singular)5. Error Recovery
section_char_countto detect malformed tool sections6. State Reset Mechanism
reset_streaming_state()public method7. Function Contract Preservation
DeltaMessage(content="")instead ofNoneFiles Changed
Modified
vllm/entrypoints/openai/tool_parsers/kimi_k2_tool_parser.pyin_tool_section,token_buffer,section_char_count,buffer_max_size,max_section_chars,_buffer_overflow_logged)_check_and_strip_markers()helper for buffer processing_reset_section_state()andreset_streaming_state()methodsextract_tool_calls_streaming()with:tests/tool_use/test_kimi_k2_tool_parser.pytest_token_leak_between_section_and_tool_begin()- Main bug: leak preventiontest_split_markers_across_deltas()- Buffer functionalitytest_marker_variants()- Singular/plural supporttest_reentry_to_reasoning_after_tool_section()- State transitionstest_empty_tool_section()- Edge casetest_malformed_tool_section_recovery()- Error recoverytest_state_reset()- State managementtest_section_begin_noise_tool_begin_same_chunk()- Same-chunk suppressiontest_stream_ends_without_section_end_marker()- EOF handlingTesting
tests/tool_use/test_kimi_k2_tool_parser.py(pytest-based)pytest -s -v tests/tool_use/test_kimi_k2_tool_parser.py– all passpytest -s -v tests/tool_use/– no regressionsScope
KimiK2ToolParserin streaming mode