fix: suppress tool call XML from streaming text content (#129)#232
fix: suppress tool call XML from streaming text content (#129)#232sjswerdloff wants to merge 7 commits intowaybarrios:mainfrom
Conversation
Tool call XML (e.g. <minimax:tool_call>, <tool_call>) was leaking into streaming text deltas via the /v1/messages endpoint. The raw markup appeared in the client's conversation context alongside the structured tool_use block, doubling token consumption for every tool call. Add StreamingToolCallFilter that buffers streaming text and suppresses content inside tool call blocks. Handles tags split across multiple deltas, multiple tool calls per response, and preserves <think> blocks. Supports MiniMax (<minimax:tool_call>) and Qwen (<tool_call>) formats. 14 unit tests included. Fixes waybarrios#129
Add [Calling tool: ...)] to the streaming filter tag list. MiniMax-M2.5 uses this format for some tool calls alongside its native XML format. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
MiniMax generates multiple tool call formats: - <minimax:tool_call> XML (native) - <tool_call> (Qwen) - [Calling tool: ...] and [Calling tool=...] (bracket variants) - [TOOL_CALL]...[/TOOL_CALL] (block format) Consolidate bracket variants under single [Calling tool prefix with newline as delimiter. Add [TOOL_CALL] block format. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
Add <function=name>...</function> (Llama-style) to filtered tags. Now covers all formats supported by parse_tool_calls(): - MiniMax XML, Qwen XML, Qwen3 bracket, Llama function, Nemotron (via <tool_call>), and [TOOL_CALL] block. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
|
Note: There is an existing PR that addresses the same core issue (#129): #132. That approach uses a state machine ( Key differences in this PR:
Happy to contribute the additional format coverage to #132 if consolidating is preferred, or keep this as a standalone focused fix. Either way, the MiniMax and bracket-style formats need coverage. |
Add StreamingThinkRouter that separates thinking from response text. Models that inject <think> in the generation prompt (MiniMax, Qwen3, DeepSeek-R1) are auto-detected from the chat template. Stream pipeline: raw text → tool call filter → think router → emit Thinking content emits as Anthropic thinking content blocks (thinking_delta) so clients render them distinctly from responses.
|
Good fix for a real problem -- tool call XML leaking into streaming text deltas is something we've hit in production. A few notes:
|
uv.lock is not tracked upstream - accidentally included. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
|
Thanks for the review @Thump604! 1. uv.lock removed — pushed a commit removing it. Accidentally included. 2. Ordering with tool parsers: The filter prevents markup from appearing in 3. Interaction with #177: Our filter operates on the raw delta before reasoning parsing. #177 fixes tool calls inside reasoning blocks after reasoning extraction. These are different stages of the pipeline — our filter strips markup at the delta level, #177 ensures the reasoning→tool_call handoff works correctly. Should be compatible. 4. MiniMax format: Yes, |
_stream_anthropic_messages() never read prompt_tokens from the engine, always reporting 0 input_tokens. Now tracks prompt_tokens alongside completion_tokens and includes input_tokens in message_delta usage. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
|
Thanks for the detailed response. The upstream-of-parsers flow makes sense and the #177 compatibility analysis is correct — different pipeline stages, no conflict. LGTM. |
Tool call XML (e.g. minimax:tool_call, <tool_call>) was leaking into streaming text deltas via the /v1/messages endpoint. The raw markup appeared in the client's conversation context alongside the structured tool_use block, doubling token consumption for every tool call.
Add StreamingToolCallFilter that buffers streaming text and suppresses content inside tool call blocks. Handles tags split across multiple deltas, multiple tool calls per response, and preserves blocks.
Supports MiniMax (minimax:tool_call) and Qwen (<tool_call>) formats.
14 unit tests included.
Fixes #129