fix(anthropic_adapter): strip streaming XML artifacts from text blocks to prevent infinite reasoning loops by PollyBot13 · Pull Request #158 · waybarrios/vllm-mlx

PollyBot13 · 2026-03-13T08:03:39Z

Fix: Streaming path leaks raw XML into text content blocks

Problem

When using Qwen3.5 (or similar models that emit <tool_call> and <think> XML), the Anthropic Messages API streaming path has two bugs:

Multi-turn history: _convert_message() passes raw XML from previous assistant text blocks back into the chat template, causing duplicate tool calls and infinite reasoning loops on turn 2+.
First-turn streaming: _stream_anthropic_messages() sends every model token immediately as content_block_delta events, including <tool_call> and <think> XML. The tool call parser runs after the stream completes, so raw XML is already sent to the client as visible text.

Fix

Strip <think> and <tool_call> XML from text blocks in _convert_message() before they enter the chat template.
Buffer streaming text, parse and clean it after completion, then emit clean text + proper tool_use blocks.

Testing

All existing tests pass
Tested with Qwen3.5-35B-A3B-4bit: multi-turn tool calling works, no XML leaks in first or subsequent turns

In streaming mode, _stream_anthropic_messages() emits all model output (including <think> reasoning and <tool_call> XML) as text_block deltas before extracting tool_use blocks from the accumulated text. This means the assistant message history contains both: - content[0] = text block with raw '<think>...</think>\n<tool_call>...</tool_call>' - content[1] = tool_use block with the parsed tool call When _convert_message() rebuilds the conversation for the next turn, the chat template (particularly Qwen3.5's Jinja template) receives an assistant message where the text content still contains <tool_call> XML. The template renders it verbatim AND appends the tool_calls field separately, resulting in duplicate tool calls in the prompt. The model then receives a prompt with the same tool call twice: 1. Once from the leaked XML in the text content 2. Once from the tool_calls field (correct path) This causes Qwen3.5 to enter an infinite reasoning loop trying to reconcile which tool call the tool_result corresponds to. All output is inside <think> tags, which SPECIAL_TOKENS_PATTERN strips to empty strings — the server logs show '2 chunks total, elapsed=900s' because nothing visible is produced until max_tokens is exhausted. Fix: strip <think> and <tool_call> XML from text blocks in _convert_message() before building the OpenAI-format message. The tool_use blocks remain intact (correct source of truth for tool calls), so tool calling continues to work correctly. Text content that is genuinely part of the response (e.g. 'I will search for X') is preserved. Observed symptom: first tool call works, model goes silent for ~15 minutes after receiving tool results (900-second stalls observed). Affects: Qwen3.5 and any other model using chat templates that render tool calls from both the text content and the tool_calls field.

The streaming path in _stream_anthropic_messages() sent every model token immediately as content_block_delta events, including raw <tool_call> and <think> XML. The tool call parser only ran after the stream completed, so by then the raw XML was already sent to the client as text content. Fix: buffer all text during streaming, then parse and clean it before emitting. The client receives clean text + proper tool_use blocks. Trade-off: text responses are emitted as one chunk instead of token-by-token. For most API consumers this has no practical impact since they wait for message_stop anyway.

janhilgard · 2026-04-11T15:50:46Z

Hey @PollyBot13 — thanks for identifying these two bugs (multi-turn XML leak + first-turn streaming leak).

Both are now resolved in main through a different architecture:

Problem	Current solution in `main`
XML in multi-turn history	Reasoning parser (`--reasoning-parser qwen3`) extracts `<think>` blocks; tool parser consumes `<tool_call>`/`<function=...>` markup — only clean text enters the response and subsequent turns
XML leak in streaming	`tool_markup_possible` flag + `tool_accumulated_text` buffering + streaming reasoning parser + tool parser partial-marker buffering (PR #281) + safety nets for leaked markup

These landed via PR #256 (reasoning parsers, Gemma 4 patches) and PR #278 (production backport with full streaming pipeline).

Closing as superseded — the underlying issues are fully addressed. Thanks for the contribution!

PollyBot13 added 2 commits March 13, 2026 09:02

janhilgard closed this Apr 11, 2026

janhilgard mentioned this pull request Apr 11, 2026

fix: integrate tool call parsing with reasoning parser in streaming mode #148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(anthropic_adapter): strip streaming XML artifacts from text blocks to prevent infinite reasoning loops#158

fix(anthropic_adapter): strip streaming XML artifacts from text blocks to prevent infinite reasoning loops#158
PollyBot13 wants to merge 2 commits intowaybarrios:mainfrom
PollyBot13:fix/streaming-tool-call-xml-leak

PollyBot13 commented Mar 13, 2026 •

edited

Loading

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PollyBot13 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Streaming path leaks raw XML into text content blocks

Problem

Fix

Testing

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PollyBot13 commented Mar 13, 2026 •

edited

Loading