Skip to content

Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests#132

Open
0age wants to merge 9 commits intowaybarrios:mainfrom
0age:main
Open

Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests#132
0age wants to merge 9 commits intowaybarrios:mainfrom
0age:main

Conversation

@0age
Copy link
Copy Markdown

@0age 0age commented Mar 2, 2026

This PR implements #129 and includes test coverage.

Fix (6805baf): Adds _AnthropicStreamScrubber, a stateful stream scrubber that strips …, <tool_call>…</tool_call>, <function=NAME>…, and <parameter=NAME>… markup from Anthropic /v1/messages streaming text deltas. When tools are present in the request, the scrubber is activated so clients only see clean text and structured tool_use blocks. Handles tags split across token boundaries via a carry buffer.

Tests (c1f7cac): Adds 83 unit tests for _AnthropicStreamScrubber covering:

Suppression of all tag types (single delta, split across boundaries, nested)
Stray closing tag removal
Carry buffer correctness when tags are split character-by-character
State machine transitions (TEXT ↔ IN_THINK / IN_TOOLCALL / IN_FUNCTION)
flush() semantics (emit vs discard based on mode, residual tag stripping)
Edge cases (empty blocks, unicode, very long content, angle brackets in plain text)
Realistic token-by-token streaming simulations
Pure logic tests with no model loading required.

0age added 6 commits March 1, 2026 15:57
…rrios#129)

Add _AnthropicStreamScrubber to filter model-internal markup from
streamed text_delta events on the /v1/messages endpoint.

When tools are present in the request, the scrubber strips:
- <think>...</think> reasoning blocks
- <tool_call>...</tool_call> Qwen/Hermes-style tool calls
- <function=NAME>...</function> Llama-style tool calls
- <parameter=NAME>...</parameter> Llama-style parameters
- Stray closing tags outside their expected context

Uses a state machine with a carry buffer to handle tags split
across token boundaries. Raw text is still accumulated for
tool-call parsing, so structured tool_use blocks emit correctly.
Tests cover all tag suppression patterns (<think>, <tool_call>,
<function=NAME>, <parameter=NAME>), stray closing tags, carry buffer
behavior, state machine transitions, flush semantics, edge cases,
and realistic token-by-token streaming scenarios.
Two improvements to _AnthropicStreamScrubber:

1. Always enable the scrubber for Anthropic streaming, not only when
   tools are present. Reasoning models emit <think> tags regardless
   of whether tools are in the request.

2. Make carry buffer conditional: only retain a suffix when there is
   a '<' near the tail that could be the start of a split tag.
   Plain text with no angle brackets now streams with zero latency
   instead of being held back by CARRY_N chars.

Tests updated accordingly (91 tests, all passing).
When clients set thinking: {type: 'enabled'} in /v1/messages requests,
the server now emits proper Anthropic thinking content blocks instead
of suppressing <think> content:

- content_block_start with type 'thinking'
- content_block_delta with thinking_delta events
- content_block_stop when thinking ends
- Then normal text_delta on a separate content block

This enables clients like Cline to render thinking in grey text
while keeping user-facing text on its own channel.

Implementation:
- AnthropicRequest now accepts a 'thinking' field
- _AnthropicStreamRouter routes <think> content to thinking_delta
  (vs _AnthropicStreamScrubber which suppresses it)
- _is_thinking_enabled() detects client preference
- Dynamic content block indexing (thinking=0, text=1 when enabled)
- When thinking is NOT requested, existing scrubber behavior preserved

114 tests passing (91 scrubber + 23 router/thinking).
When thinking is enabled, emit content_block_start for BOTH the
thinking block (index 0) and text block (index 1) upfront in
message_start, rather than lazily. This matches the Anthropic
streaming protocol that clients like Cline expect.

Deltas are then routed to the correct index:
  - thinking content → index 0, thinking_delta
  - text content → index 1, text_delta

The thinking_start signal from the router is now a no-op since
the block is already open.
The model's chat template injects <think> into the prompt, so the
model's first output IS thinking content — no <think> tag appears
in the output, only </think> to end the thinking phase.

The router now starts in IN_THINK mode (start_in_thinking=True)
when thinking is enabled, so initial output immediately routes to
thinking_delta events on index 0.
jackzampolin added a commit to jackzampolin/vllm-mlx that referenced this pull request Mar 11, 2026
…<tool_call> markup

Add AnthropicStreamScrubber with state machine to strip model-internal
markup from /v1/messages streaming. Add AnthropicStreamRouter for proper
thinking block support when client requests it.

91 scrubber tests + 23 router/thinking tests included.

Cherry-picked from: waybarrios#132
Original author: 0age

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waybarrios waybarrios self-requested a review March 12, 2026 17:15
@waybarrios waybarrios self-assigned this Mar 12, 2026
@waybarrios
Copy link
Copy Markdown
Owner

Delete a python's import that it is not used: 1a27261

@waybarrios
Copy link
Copy Markdown
Owner

Went through the scrubber and router implementation pretty carefully. Overall this is solid work, the state machine approach is clean and the test coverage is impressive (83 tests). Here's what I found though:

Carry buffer can grow unbounded

When _find_earliest_marker finds a prefix tag like <function= but the closing > never shows up, the carry keeps accumulating every new delta indefinitely. A model that outputs something like You can use <function= followed by a bunch of text without ever closing the > will just keep growing the buffer. Should probably cap it at something like MAX_TAG * 2 and emit it as literal text if the > doesn't arrive.

flush() leaks incomplete prefix tags

The regexes in flush() require a closing > to match (<function=[^>]*>). If the stream ends with the carry holding something like <function= or <function=myFunc without the >, neither regex matches and the raw markup goes straight to the client. Same issue in _AnthropicStreamRouter.flush(). Would need an extra pattern like <function=[^>]*$ and <parameter=[^>]*$ to catch those.

start_in_thinking=True is always hardcoded

The router is always constructed with start_in_thinking=True when thinking is enabled. This assumes every thinking-capable model has its template pre-inject <think>. If a model answers directly without thinking, everything gets routed as thinking_delta and the text block stays empty. If it emits <think> explicitly, that literal string leaks into the thinking content since the router is already in IN_THINK mode.

<parameter=name> standalone suppresses all subsequent text

When <parameter=name> appears in TEXT mode it transitions to IN_FUNCTION, which only closes on </function> not </parameter>. So everything after a standalone <parameter= gets eaten until </function> shows up or the stream ends. The test acknowledges this but doesn't assert the suppressed content. Might want a separate IN_PARAMETER state or at least add </parameter> as an alternative closer.

_implicit_think attribute is set but never read

self._implicit_think = start_in_thinking in the router constructor is never referenced anywhere in the class.

import re inside method bodies

Both flush() methods import re inline. The module is already used elsewhere in server.py so it should just go at the top level.

Router code duplicates scrubber logic

The router's feed() is basically a copy of the scrubber's feed() with the think branch swapped for thinking_start/thinking/thinking_stop emission. The only shared piece is _find_earliest_marker delegated through a scrubber instance. Any future tag constant or carry buffer change has to be applied in two places. Not blocking but worth noting for maintainability.

Non-streaming path still leaks <think> tags

The non-streaming /v1/messages path uses clean_output_text which explicitly keeps <think> blocks intact. So non-streaming Anthropic clients with reasoning models will still see the raw markup. This predates the PR but worth being aware of.

I also pushed a couple of commits to fix the lint failures (unused pytest import and black formatting).

Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work overall, the state machine is clean and test coverage is great. Left some inline comments on specific things I noticed.

# Prefix tag found but closing '>' missing – truncated.
if pos > i:
out.append(s[i:pos])
self.carry = s[pos:]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the closing > never arrives (model outputs something like You can use <function= then keeps generating text), this carry grows unbounded with every new delta. Each feed() call appends the new delta to the carry, and since > is still missing, the same path runs again storing everything from pos onward.

Might want to cap this at something like MAX_TAG * 2 and just emit it as literal text if > hasn't shown up by then.

# Strip any residual prefix tags (e.g. ``<function=foo>``).
import re

result = re.sub(r"<function=[^>]*>", "", result)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These regexes require a closing > to match. If the stream ends with carry holding something like <function=myFunc (no >), neither regex catches it and the raw markup leaks to the client.

Same issue exists in the router's flush() at line 2030.

Would need an extra pattern like <function=[^>]*$ and <parameter=[^>]*$ to catch partial prefixes at end of stream.

# Use the stream router which yields typed (kind, text) pieces
# that separate thinking content from user-facing text.
router: _AnthropicStreamRouter | None = _AnthropicStreamRouter(
start_in_thinking=True
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcodes start_in_thinking=True assuming every thinking-capable model has its template pre-inject <think>. If a model answers directly without a <think> block, all output gets routed as thinking_delta and the text block stays empty. If it emits <think> explicitly as its first token, the literal string leaks into thinking content since the router is already in IN_THINK mode.

Might be safer to default to False and let the router detect <think> in-stream, unless you're sure all supported models inject it via template.

THINK_OPEN: "IN_THINK",
TOOL_OPEN: "IN_TOOLCALL",
FUNC_PREFIX: "IN_FUNCTION",
PARAM_PREFIX: "IN_FUNCTION", # parameters inside function blocks
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When <parameter=name> shows up standalone (outside a <function> block), it transitions to IN_FUNCTION which only closes on </function>, not </parameter>. So everything after a standalone <parameter= gets eaten until </function> appears or the stream ends.

Might want either a separate IN_PARAMETER state or add </parameter> as an alternative closer for IN_FUNCTION.

# <think> into the prompt, so the first output IS thinking content).
self.mode: str = "IN_THINK" if start_in_thinking else "TEXT"
self.carry: str = ""
self._implicit_think = start_in_thinking
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute is set here but never read anywhere in the class. Dead code that can be removed.

for tag in self._EXACT_TAGS:
result = result.replace(tag, "")
# Strip any residual prefix tags (e.g. ``<function=foo>``).
import re
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re is already used elsewhere in server.py, this should go at the top of the file with the other imports instead of inline inside the method. Same thing at line 2028.

@waybarrios
Copy link
Copy Markdown
Owner

@0age nice work on this one! The scrubber approach is really well thought out and the test coverage is solid. I left a few inline comments on things I noticed while going through it, nothing blocking but worth a look when you get a chance. Also pushed a couple small commits to fix the lint failures (unused import + black formatting). Let me know if you have any questions about the comments!

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work — the state machine approach is the right call for streaming, and the test coverage is impressive.

A few things beyond what @waybarrios already flagged:

SPECIAL_TOKENS_PATTERN already strips <tool_call> tags

In vllm_mlx/api/utils.py, SPECIAL_TOKENS_PATTERN already matches </?tool_call>|</?tool_call_reasoning>. Since _stream_anthropic_messages applies this pattern before feeding content to the scrubber (via clean_output_text()), the scrubber will never actually see <tool_call> or </tool_call> tags — they're already gone.

This means the scrubber's IN_TOOLCALL mode is dead code in practice. Either:

  1. Remove <tool_call> from SPECIAL_TOKENS_PATTERN and let the scrubber handle it (cleaner separation of concerns), or
  2. Remove IN_TOOLCALL from the scrubber and document that SPECIAL_TOKENS_PATTERN handles it

I'd lean toward (1) — the scrubber is the right place for semantic tag stripping, and SPECIAL_TOKENS_PATTERN should only handle tokenizer-level special tokens.

Scrubber/Router duplication

_AnthropicStreamRouter.feed() duplicates ~80% of _AnthropicStreamScrubber.feed(). The only difference is that the router yields ("thinking", text) pieces instead of suppressing. Consider:

  • A single class with an emit_thinking: bool parameter
  • Or have the Router subclass the Scrubber, overriding only the IN_THINK handling

This would halve the code and ensure bug fixes apply to both paths.

Carry growth cap for prefix tags

To expand on @waybarrios's unbounded carry comment: when <function= is found but > never arrives, each feed() call does s = self.carry + delta, finds the prefix again at position 0, and stores everything from there as carry. A simple fix:

if len(self.carry) > self.CARRY_N + 256:
    # Prefix tag never closed — treat as plain text
    out.append(self.carry)
    self.carry = ""

Overall this is a welcome fix for a real usability problem. The core logic is sound — the issues above are about reducing surface area and hardening edge cases.

- Refactor _AnthropicStreamRouter as subclass of _AnthropicStreamScrubber,
  eliminating ~120 lines of duplicated state-machine logic via shared
  _feed_pieces()/_flush_pieces() methods.

- Add dedicated IN_PARAMETER state so <parameter=x>...</parameter> closes
  on </parameter> instead of incorrectly requiring </function>.

- Add MAX_CARRY cap to prevent unbounded carry buffer growth when a prefix
  tag like <function= never receives its closing >.

- Fix flush() to strip incomplete prefix tags using regex.

- Move 'import re' to module-level instead of inside flush().

- Change start_in_thinking default from True to False.

- Add comprehensive tests for carry cap, flush with incomplete prefix tags,
  IN_PARAMETER state, and Router-inherits-Scrubber verification.
@0age
Copy link
Copy Markdown
Author

0age commented Mar 25, 2026

@waybarrios & @janhilgard thanks for the reviews! let me know if there's anything else you think needs to be addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants