Skip to content

fix(anthropic_adapter): strip streaming XML artifacts from text blocks to prevent infinite reasoning loops#158

Closed
PollyBot13 wants to merge 2 commits intowaybarrios:mainfrom
PollyBot13:fix/streaming-tool-call-xml-leak
Closed

fix(anthropic_adapter): strip streaming XML artifacts from text blocks to prevent infinite reasoning loops#158
PollyBot13 wants to merge 2 commits intowaybarrios:mainfrom
PollyBot13:fix/streaming-tool-call-xml-leak

Conversation

@PollyBot13
Copy link
Copy Markdown

@PollyBot13 PollyBot13 commented Mar 13, 2026

Fix: Streaming path leaks raw XML into text content blocks

Problem

When using Qwen3.5 (or similar models that emit <tool_call> and <think> XML), the Anthropic Messages API streaming path has two bugs:

  1. Multi-turn history: _convert_message() passes raw XML from previous assistant text blocks back into the chat template, causing duplicate tool calls and infinite reasoning loops on turn 2+.

  2. First-turn streaming: _stream_anthropic_messages() sends every model token immediately as content_block_delta events, including <tool_call> and <think> XML. The tool call parser runs after the stream completes, so raw XML is already sent to the client as visible text.

Fix

  1. Strip <think> and <tool_call> XML from text blocks in _convert_message() before they enter the chat template.
  2. Buffer streaming text, parse and clean it after completion, then emit clean text + proper tool_use blocks.

Testing

  • All existing tests pass
  • Tested with Qwen3.5-35B-A3B-4bit: multi-turn tool calling works, no XML leaks in first or subsequent turns

In streaming mode, _stream_anthropic_messages() emits all model output
(including <think> reasoning and <tool_call> XML) as text_block deltas
before extracting tool_use blocks from the accumulated text. This means
the assistant message history contains both:
  - content[0] = text block with raw '<think>...</think>\n<tool_call>...</tool_call>'
  - content[1] = tool_use block with the parsed tool call

When _convert_message() rebuilds the conversation for the next turn,
the chat template (particularly Qwen3.5's Jinja template) receives an
assistant message where the text content still contains <tool_call> XML.
The template renders it verbatim AND appends the tool_calls field
separately, resulting in duplicate tool calls in the prompt.

The model then receives a prompt with the same tool call twice:
  1. Once from the leaked XML in the text content
  2. Once from the tool_calls field (correct path)

This causes Qwen3.5 to enter an infinite reasoning loop trying to
reconcile which tool call the tool_result corresponds to. All output is
inside <think> tags, which SPECIAL_TOKENS_PATTERN strips to empty
strings — the server logs show '2 chunks total, elapsed=900s' because
nothing visible is produced until max_tokens is exhausted.

Fix: strip <think> and <tool_call> XML from text blocks in
_convert_message() before building the OpenAI-format message. The
tool_use blocks remain intact (correct source of truth for tool calls),
so tool calling continues to work correctly. Text content that is
genuinely part of the response (e.g. 'I will search for X') is
preserved.

Observed symptom: first tool call works, model goes silent for ~15
minutes after receiving tool results (900-second stalls observed).

Affects: Qwen3.5 and any other model using chat templates that
render tool calls from both the text content and the tool_calls field.
The streaming path in _stream_anthropic_messages() sent every model token
immediately as content_block_delta events, including raw <tool_call> and
<think> XML. The tool call parser only ran after the stream completed,
so by then the raw XML was already sent to the client as text content.

Fix: buffer all text during streaming, then parse and clean it before
emitting. The client receives clean text + proper tool_use blocks.

Trade-off: text responses are emitted as one chunk instead of
token-by-token. For most API consumers this has no practical impact
since they wait for message_stop anyway.
@janhilgard
Copy link
Copy Markdown
Collaborator

Hey @PollyBot13 — thanks for identifying these two bugs (multi-turn XML leak + first-turn streaming leak).

Both are now resolved in main through a different architecture:

Problem Current solution in main
XML in multi-turn history Reasoning parser (--reasoning-parser qwen3) extracts <think> blocks; tool parser consumes <tool_call>/<function=...> markup — only clean text enters the response and subsequent turns
XML leak in streaming tool_markup_possible flag + tool_accumulated_text buffering + streaming reasoning parser + tool parser partial-marker buffering (PR #281) + safety nets for leaked markup

These landed via PR #256 (reasoning parsers, Gemma 4 patches) and PR #278 (production backport with full streaming pipeline).

Closing as superseded — the underlying issues are fully addressed. Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants