Skip to content

fix: suppress tool call XML from streaming text content (#129)#232

Open
sjswerdloff wants to merge 7 commits intowaybarrios:mainfrom
sjswerdloff:fix/streaming-tool-call-content-leak
Open

fix: suppress tool call XML from streaming text content (#129)#232
sjswerdloff wants to merge 7 commits intowaybarrios:mainfrom
sjswerdloff:fix/streaming-tool-call-content-leak

Conversation

@sjswerdloff
Copy link
Copy Markdown

Tool call XML (e.g. minimax:tool_call, <tool_call>) was leaking into streaming text deltas via the /v1/messages endpoint. The raw markup appeared in the client's conversation context alongside the structured tool_use block, doubling token consumption for every tool call.

Add StreamingToolCallFilter that buffers streaming text and suppresses content inside tool call blocks. Handles tags split across multiple deltas, multiple tool calls per response, and preserves blocks.

Supports MiniMax (minimax:tool_call) and Qwen (<tool_call>) formats.

14 unit tests included.

Fixes #129

sjswerdloff and others added 2 commits March 29, 2026 22:41
Tool call XML (e.g. <minimax:tool_call>, <tool_call>) was leaking into
streaming text deltas via the /v1/messages endpoint. The raw markup
appeared in the client's conversation context alongside the structured
tool_use block, doubling token consumption for every tool call.

Add StreamingToolCallFilter that buffers streaming text and suppresses
content inside tool call blocks. Handles tags split across multiple
deltas, multiple tool calls per response, and preserves <think> blocks.

Supports MiniMax (<minimax:tool_call>) and Qwen (<tool_call>) formats.

14 unit tests included.

Fixes waybarrios#129
Add [Calling tool: ...)] to the streaming filter tag list.
MiniMax-M2.5 uses this format for some tool calls alongside its
native XML format.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
@sjswerdloff sjswerdloff marked this pull request as draft March 29, 2026 09:52
sjswerdloff and others added 2 commits March 29, 2026 22:56
MiniMax generates multiple tool call formats:
- <minimax:tool_call> XML (native)
- <tool_call> (Qwen)
- [Calling tool: ...] and [Calling tool=...] (bracket variants)
- [TOOL_CALL]...[/TOOL_CALL] (block format)

Consolidate bracket variants under single [Calling tool prefix with
newline as delimiter. Add [TOOL_CALL] block format.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
Add <function=name>...</function> (Llama-style) to filtered tags.
Now covers all formats supported by parse_tool_calls():
- MiniMax XML, Qwen XML, Qwen3 bracket, Llama function,
  Nemotron (via <tool_call>), and [TOOL_CALL] block.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
@sjswerdloff sjswerdloff marked this pull request as ready for review March 29, 2026 10:03
@sjswerdloff
Copy link
Copy Markdown
Author

Note: There is an existing PR that addresses the same core issue (#129): #132.

That approach uses a state machine (_AnthropicStreamScrubber) that also handles <think> tag routing. This PR takes a simpler tag-pair filter approach focused specifically on tool call markup suppression.

Key differences in this PR:

Happy to contribute the additional format coverage to #132 if consolidating is preferred, or keep this as a standalone focused fix. Either way, the MiniMax and bracket-style formats need coverage.

@sjswerdloff sjswerdloff marked this pull request as draft March 29, 2026 10:28
Add StreamingThinkRouter that separates thinking from response text.
Models that inject <think> in the generation prompt (MiniMax, Qwen3,
DeepSeek-R1) are auto-detected from the chat template.

Stream pipeline: raw text → tool call filter → think router → emit

Thinking content emits as Anthropic thinking content blocks
(thinking_delta) so clients render them distinctly from responses.
@sjswerdloff sjswerdloff marked this pull request as ready for review March 29, 2026 10:34
@Thump604
Copy link
Copy Markdown
Contributor

Good fix for a real problem -- tool call XML leaking into streaming text deltas is something we've hit in production. A few notes:

  1. The uv.lock file (+7571 lines) should probably be excluded from this PR -- it's unrelated noise and makes review harder.

  2. How does this interact with the existing tool call parsers (qwen3_xml, hermes)? The parsers already extract tool calls from the response -- this filter operates upstream of them in the streaming path? Would be good to clarify the ordering: does StreamingToolCallFilter run before or after the parser's extract_tool_calls_streaming()?

  3. Our PR fix: parse tool calls in streaming reasoning branch #177 (streaming tool+reasoning coexistence) addresses a related issue -- tool calls being lost when they appear inside reasoning blocks. These are complementary fixes but touching the same streaming path. Worth checking for interaction.

  4. The MiniMax format support (<minimax:tool_call>) is useful -- we have MiniMax tool parsing via qwen3_coder (HermesToolParser) and this would clean up the streaming side.

uv.lock is not tracked upstream - accidentally included.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
@sjswerdloff
Copy link
Copy Markdown
Author

Thanks for the review @Thump604!

1. uv.lock removed — pushed a commit removing it. Accidentally included.

2. Ordering with tool parsers: StreamingToolCallFilter runs upstream of the parsers. The flow is:

raw delta text → StreamingToolCallFilter (strips markup from visible text) → emit to client
accumulated unfiltered text → _parse_tool_calls_with_parser() at stream end → structured tool_calls in response

The filter prevents markup from appearing in content deltas. The parser extracts structured tool calls from the full accumulated text after streaming completes. They're complementary — filter handles presentation, parser handles extraction.

3. Interaction with #177: Our filter operates on the raw delta before reasoning parsing. #177 fixes tool calls inside reasoning blocks after reasoning extraction. These are different stages of the pipeline — our filter strips markup at the delta level, #177 ensures the reasoning→tool_call handoff works correctly. Should be compatible.

4. MiniMax format: Yes, <minimax:tool_call> is the format MiniMax-M2.5 uses natively. Without the filter, the full XML block leaks into streaming text deltas.

_stream_anthropic_messages() never read prompt_tokens from the engine,
always reporting 0 input_tokens. Now tracks prompt_tokens alongside
completion_tokens and includes input_tokens in message_delta usage.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
@Thump604
Copy link
Copy Markdown
Contributor

Thanks for the detailed response. The upstream-of-parsers flow makes sense and the #177 compatibility analysis is correct — different pipeline stages, no conflict. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Anthropic Streaming /v1/messages Leaks <think> and <tool_call> Markup Before Structured tool_use

2 participants