fix(anthropic_adapter): strip streaming XML artifacts from text blocks to prevent infinite reasoning loops#158
Closed
PollyBot13 wants to merge 2 commits intowaybarrios:mainfrom
Closed
Conversation
In streaming mode, _stream_anthropic_messages() emits all model output (including <think> reasoning and <tool_call> XML) as text_block deltas before extracting tool_use blocks from the accumulated text. This means the assistant message history contains both: - content[0] = text block with raw '<think>...</think>\n<tool_call>...</tool_call>' - content[1] = tool_use block with the parsed tool call When _convert_message() rebuilds the conversation for the next turn, the chat template (particularly Qwen3.5's Jinja template) receives an assistant message where the text content still contains <tool_call> XML. The template renders it verbatim AND appends the tool_calls field separately, resulting in duplicate tool calls in the prompt. The model then receives a prompt with the same tool call twice: 1. Once from the leaked XML in the text content 2. Once from the tool_calls field (correct path) This causes Qwen3.5 to enter an infinite reasoning loop trying to reconcile which tool call the tool_result corresponds to. All output is inside <think> tags, which SPECIAL_TOKENS_PATTERN strips to empty strings — the server logs show '2 chunks total, elapsed=900s' because nothing visible is produced until max_tokens is exhausted. Fix: strip <think> and <tool_call> XML from text blocks in _convert_message() before building the OpenAI-format message. The tool_use blocks remain intact (correct source of truth for tool calls), so tool calling continues to work correctly. Text content that is genuinely part of the response (e.g. 'I will search for X') is preserved. Observed symptom: first tool call works, model goes silent for ~15 minutes after receiving tool results (900-second stalls observed). Affects: Qwen3.5 and any other model using chat templates that render tool calls from both the text content and the tool_calls field.
The streaming path in _stream_anthropic_messages() sent every model token immediately as content_block_delta events, including raw <tool_call> and <think> XML. The tool call parser only ran after the stream completed, so by then the raw XML was already sent to the client as text content. Fix: buffer all text during streaming, then parse and clean it before emitting. The client receives clean text + proper tool_use blocks. Trade-off: text responses are emitted as one chunk instead of token-by-token. For most API consumers this has no practical impact since they wait for message_stop anyway.
Collaborator
|
Hey @PollyBot13 — thanks for identifying these two bugs (multi-turn XML leak + first-turn streaming leak). Both are now resolved in
These landed via PR #256 (reasoning parsers, Gemma 4 patches) and PR #278 (production backport with full streaming pipeline). Closing as superseded — the underlying issues are fully addressed. Thanks for the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: Streaming path leaks raw XML into text content blocks
Problem
When using Qwen3.5 (or similar models that emit
<tool_call>and<think>XML), the Anthropic Messages API streaming path has two bugs:Multi-turn history:
_convert_message()passes raw XML from previous assistant text blocks back into the chat template, causing duplicate tool calls and infinite reasoning loops on turn 2+.First-turn streaming:
_stream_anthropic_messages()sends every model token immediately ascontent_block_deltaevents, including<tool_call>and<think>XML. The tool call parser runs after the stream completes, so raw XML is already sent to the client as visible text.Fix
<think>and<tool_call>XML from text blocks in_convert_message()before they enter the chat template.tool_useblocks.Testing