Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests by 0age · Pull Request #132 · waybarrios/vllm-mlx

0age · 2026-03-02T00:32:06Z

This PR implements #129 and includes test coverage.

Fix (6805baf): Adds _AnthropicStreamScrubber, a stateful stream scrubber that strips …, <tool_call>…</tool_call>, <function=NAME>…, and <parameter=NAME>… markup from Anthropic /v1/messages streaming text deltas. When tools are present in the request, the scrubber is activated so clients only see clean text and structured tool_use blocks. Handles tags split across token boundaries via a carry buffer.

Tests (c1f7cac): Adds 83 unit tests for _AnthropicStreamScrubber covering:

Suppression of all tag types (single delta, split across boundaries, nested)
Stray closing tag removal
Carry buffer correctness when tags are split character-by-character
State machine transitions (TEXT ↔ IN_THINK / IN_TOOLCALL / IN_FUNCTION)
flush() semantics (emit vs discard based on mode, residual tag stripping)
Edge cases (empty blocks, unicode, very long content, angle brackets in plain text)
Realistic token-by-token streaming simulations
Pure logic tests with no model loading required.

…rrios#129) Add _AnthropicStreamScrubber to filter model-internal markup from streamed text_delta events on the /v1/messages endpoint. When tools are present in the request, the scrubber strips: - <think>...</think> reasoning blocks - <tool_call>...</tool_call> Qwen/Hermes-style tool calls - <function=NAME>...</function> Llama-style tool calls - <parameter=NAME>...</parameter> Llama-style parameters - Stray closing tags outside their expected context Uses a state machine with a carry buffer to handle tags split across token boundaries. Raw text is still accumulated for tool-call parsing, so structured tool_use blocks emit correctly.

Tests cover all tag suppression patterns (<think>, <tool_call>, <function=NAME>, <parameter=NAME>), stray closing tags, carry buffer behavior, state machine transitions, flush semantics, edge cases, and realistic token-by-token streaming scenarios.

Two improvements to _AnthropicStreamScrubber: 1. Always enable the scrubber for Anthropic streaming, not only when tools are present. Reasoning models emit <think> tags regardless of whether tools are in the request. 2. Make carry buffer conditional: only retain a suffix when there is a '<' near the tail that could be the start of a split tag. Plain text with no angle brackets now streams with zero latency instead of being held back by CARRY_N chars. Tests updated accordingly (91 tests, all passing).

When clients set thinking: {type: 'enabled'} in /v1/messages requests, the server now emits proper Anthropic thinking content blocks instead of suppressing <think> content: - content_block_start with type 'thinking' - content_block_delta with thinking_delta events - content_block_stop when thinking ends - Then normal text_delta on a separate content block This enables clients like Cline to render thinking in grey text while keeping user-facing text on its own channel. Implementation: - AnthropicRequest now accepts a 'thinking' field - _AnthropicStreamRouter routes <think> content to thinking_delta (vs _AnthropicStreamScrubber which suppresses it) - _is_thinking_enabled() detects client preference - Dynamic content block indexing (thinking=0, text=1 when enabled) - When thinking is NOT requested, existing scrubber behavior preserved 114 tests passing (91 scrubber + 23 router/thinking).

When thinking is enabled, emit content_block_start for BOTH the thinking block (index 0) and text block (index 1) upfront in message_start, rather than lazily. This matches the Anthropic streaming protocol that clients like Cline expect. Deltas are then routed to the correct index: - thinking content → index 0, thinking_delta - text content → index 1, text_delta The thinking_start signal from the router is now a no-op since the block is already open.

The model's chat template injects <think> into the prompt, so the model's first output IS thinking content — no <think> tag appears in the output, only </think> to end the thinking phase. The router now starts in IN_THINK mode (start_in_thinking=True) when thinking is enabled, so initial output immediately routes to thinking_delta events on index 0.

…<tool_call> markup Add AnthropicStreamScrubber with state machine to strip model-internal markup from /v1/messages streaming. Add AnthropicStreamRouter for proper thinking block support when client requests it. 91 scrubber tests + 23 router/thinking tests included. Cherry-picked from: waybarrios#132 Original author: 0age Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

waybarrios · 2026-03-12T18:55:42Z

Delete a python's import that it is not used: 1a27261

waybarrios · 2026-03-12T19:13:38Z

Went through the scrubber and router implementation pretty carefully. Overall this is solid work, the state machine approach is clean and the test coverage is impressive (83 tests). Here's what I found though:

Carry buffer can grow unbounded

When _find_earliest_marker finds a prefix tag like <function= but the closing > never shows up, the carry keeps accumulating every new delta indefinitely. A model that outputs something like You can use <function= followed by a bunch of text without ever closing the > will just keep growing the buffer. Should probably cap it at something like MAX_TAG * 2 and emit it as literal text if the > doesn't arrive.

flush() leaks incomplete prefix tags

The regexes in flush() require a closing > to match (<function=[^>]*>). If the stream ends with the carry holding something like <function= or <function=myFunc without the >, neither regex matches and the raw markup goes straight to the client. Same issue in _AnthropicStreamRouter.flush(). Would need an extra pattern like <function=[^>]*$ and <parameter=[^>]*$ to catch those.

start_in_thinking=True is always hardcoded

The router is always constructed with start_in_thinking=True when thinking is enabled. This assumes every thinking-capable model has its template pre-inject <think>. If a model answers directly without thinking, everything gets routed as thinking_delta and the text block stays empty. If it emits <think> explicitly, that literal string leaks into the thinking content since the router is already in IN_THINK mode.

<parameter=name> standalone suppresses all subsequent text

When <parameter=name> appears in TEXT mode it transitions to IN_FUNCTION, which only closes on </function> not </parameter>. So everything after a standalone <parameter= gets eaten until </function> shows up or the stream ends. The test acknowledges this but doesn't assert the suppressed content. Might want a separate IN_PARAMETER state or at least add </parameter> as an alternative closer.

_implicit_think attribute is set but never read

self._implicit_think = start_in_thinking in the router constructor is never referenced anywhere in the class.

import re inside method bodies

Both flush() methods import re inline. The module is already used elsewhere in server.py so it should just go at the top level.

Router code duplicates scrubber logic

The router's feed() is basically a copy of the scrubber's feed() with the think branch swapped for thinking_start/thinking/thinking_stop emission. The only shared piece is _find_earliest_marker delegated through a scrubber instance. Any future tag constant or carry buffer change has to be applied in two places. Not blocking but worth noting for maintainability.

Non-streaming path still leaks <think> tags

The non-streaming /v1/messages path uses clean_output_text which explicitly keeps <think> blocks intact. So non-streaming Anthropic clients with reasoning models will still see the raw markup. This predates the PR but worth being aware of.

I also pushed a couple of commits to fix the lint failures (unused pytest import and black formatting).

waybarrios

Solid work overall, the state machine is clean and test coverage is great. Left some inline comments on specific things I noticed.

waybarrios · 2026-03-12T19:17:16Z

vllm_mlx/server.py

+                    # Prefix tag found but closing '>' missing – truncated.
+                    if pos > i:
+                        out.append(s[i:pos])
+                    self.carry = s[pos:]


If the closing > never arrives (model outputs something like You can use <function= then keeps generating text), this carry grows unbounded with every new delta. Each feed() call appends the new delta to the carry, and since > is still missing, the same path runs again storing everything from pos onward.

Might want to cap this at something like MAX_TAG * 2 and just emit it as literal text if > hasn't shown up by then.

waybarrios · 2026-03-12T19:17:16Z

vllm_mlx/server.py

+            # Strip any residual prefix tags (e.g. ``<function=foo>``).
+            import re
+
+            result = re.sub(r"<function=[^>]*>", "", result)


These regexes require a closing > to match. If the stream ends with carry holding something like <function=myFunc (no >), neither regex catches it and the raw markup leaks to the client.

Same issue exists in the router's flush() at line 2030.

Would need an extra pattern like <function=[^>]*$ and <parameter=[^>]*$ to catch partial prefixes at end of stream.

waybarrios · 2026-03-12T19:17:16Z

vllm_mlx/server.py

+        # Use the stream router which yields typed (kind, text) pieces
+        # that separate thinking content from user-facing text.
+        router: _AnthropicStreamRouter | None = _AnthropicStreamRouter(
+            start_in_thinking=True


This hardcodes start_in_thinking=True assuming every thinking-capable model has its template pre-inject <think>. If a model answers directly without a <think> block, all output gets routed as thinking_delta and the text block stays empty. If it emits <think> explicitly as its first token, the literal string leaks into thinking content since the router is already in IN_THINK mode.

Might be safer to default to False and let the router detect <think> in-stream, unless you're sure all supported models inject it via template.

waybarrios · 2026-03-12T19:17:16Z

vllm_mlx/server.py

+        THINK_OPEN: "IN_THINK",
+        TOOL_OPEN: "IN_TOOLCALL",
+        FUNC_PREFIX: "IN_FUNCTION",
+        PARAM_PREFIX: "IN_FUNCTION",  # parameters inside function blocks


When <parameter=name> shows up standalone (outside a <function> block), it transitions to IN_FUNCTION which only closes on </function>, not </parameter>. So everything after a standalone <parameter= gets eaten until </function> appears or the stream ends.

Might want either a separate IN_PARAMETER state or add </parameter> as an alternative closer for IN_FUNCTION.

waybarrios · 2026-03-12T19:17:16Z

vllm_mlx/server.py

+        # <think> into the prompt, so the first output IS thinking content).
+        self.mode: str = "IN_THINK" if start_in_thinking else "TEXT"
+        self.carry: str = ""
+        self._implicit_think = start_in_thinking


This attribute is set here but never read anywhere in the class. Dead code that can be removed.

waybarrios · 2026-03-12T19:17:16Z

vllm_mlx/server.py

+            for tag in self._EXACT_TAGS:
+                result = result.replace(tag, "")
+            # Strip any residual prefix tags (e.g. ``<function=foo>``).
+            import re


re is already used elsewhere in server.py, this should go at the top of the file with the other imports instead of inline inside the method. Same thing at line 2028.

waybarrios · 2026-03-12T19:19:06Z

@0age nice work on this one! The scrubber approach is really well thought out and the test coverage is solid. I left a few inline comments on things I noticed while going through it, nothing blocking but worth a look when you get a chance. Also pushed a couple small commits to fix the lint failures (unused import + black formatting). Let me know if you have any questions about the comments!

janhilgard

Solid work — the state machine approach is the right call for streaming, and the test coverage is impressive.

A few things beyond what @waybarrios already flagged:

`SPECIAL_TOKENS_PATTERN` already strips `<tool_call>` tags

In vllm_mlx/api/utils.py, SPECIAL_TOKENS_PATTERN already matches </?tool_call>|</?tool_call_reasoning>. Since _stream_anthropic_messages applies this pattern before feeding content to the scrubber (via clean_output_text()), the scrubber will never actually see <tool_call> or </tool_call> tags — they're already gone.

This means the scrubber's IN_TOOLCALL mode is dead code in practice. Either:

Remove <tool_call> from SPECIAL_TOKENS_PATTERN and let the scrubber handle it (cleaner separation of concerns), or
Remove IN_TOOLCALL from the scrubber and document that SPECIAL_TOKENS_PATTERN handles it

I'd lean toward (1) — the scrubber is the right place for semantic tag stripping, and SPECIAL_TOKENS_PATTERN should only handle tokenizer-level special tokens.

Scrubber/Router duplication

_AnthropicStreamRouter.feed() duplicates ~80% of _AnthropicStreamScrubber.feed(). The only difference is that the router yields ("thinking", text) pieces instead of suppressing. Consider:

A single class with an emit_thinking: bool parameter
Or have the Router subclass the Scrubber, overriding only the IN_THINK handling

This would halve the code and ensure bug fixes apply to both paths.

Carry growth cap for prefix tags

To expand on @waybarrios's unbounded carry comment: when <function= is found but > never arrives, each feed() call does s = self.carry + delta, finds the prefix again at position 0, and stores everything from there as carry. A simple fix:

if len(self.carry) > self.CARRY_N + 256:
    # Prefix tag never closed — treat as plain text
    out.append(self.carry)
    self.carry = ""

Overall this is a welcome fix for a real usability problem. The core logic is sound — the issues above are about reducing surface area and hardening edge cases.

- Refactor _AnthropicStreamRouter as subclass of _AnthropicStreamScrubber, eliminating ~120 lines of duplicated state-machine logic via shared _feed_pieces()/_flush_pieces() methods. - Add dedicated IN_PARAMETER state so <parameter=x>...</parameter> closes on </parameter> instead of incorrectly requiring </function>. - Add MAX_CARRY cap to prevent unbounded carry buffer growth when a prefix tag like <function= never receives its closing >. - Fix flush() to strip incomplete prefix tags using regex. - Move 'import re' to module-level instead of inside flush(). - Change start_in_thinking default from True to False. - Add comprehensive tests for carry cap, flush with incomplete prefix tags, IN_PARAMETER state, and Router-inherits-Scrubber verification.

0age · 2026-03-25T15:44:43Z

@waybarrios & @janhilgard thanks for the reviews! let me know if there's anything else you think needs to be addressed

0age added 6 commits March 1, 2026 15:57

waybarrios self-requested a review March 12, 2026 17:15

waybarrios self-assigned this Mar 12, 2026

Remove unused pytest import to fix lint

1a27261

waybarrios force-pushed the main branch from a5e306a to 1a27261 Compare March 12, 2026 17:23

Run black formatter on scrubber and server

e5a9f3b

waybarrios reviewed Mar 12, 2026

View reviewed changes

waybarrios added the UNDER REVIEW label Mar 12, 2026

waybarrios assigned janhilgard Mar 12, 2026

janhilgard mentioned this pull request Mar 12, 2026

Address waybarrios review on PR #132 0age/vllm-mlx#1

Open

3 tasks

janhilgard reviewed Mar 14, 2026

View reviewed changes

sjswerdloff mentioned this pull request Mar 29, 2026

fix: suppress tool call XML from streaming text content (#129) #232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests#132

Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests#132
0age wants to merge 9 commits intowaybarrios:mainfrom
0age:main

0age commented Mar 2, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

waybarrios left a comment

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

janhilgard left a comment

Uh oh!

0age commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0age commented Mar 2, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

SPECIAL_TOKENS_PATTERN already strips <tool_call> tags

Scrubber/Router duplication

Carry growth cap for prefix tags

Uh oh!

0age commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`SPECIAL_TOKENS_PATTERN` already strips `<tool_call>` tags

0age commented Mar 25, 2026 •

edited

Loading