Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests#132
Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests#1320age wants to merge 9 commits intowaybarrios:mainfrom
Conversation
…rrios#129) Add _AnthropicStreamScrubber to filter model-internal markup from streamed text_delta events on the /v1/messages endpoint. When tools are present in the request, the scrubber strips: - <think>...</think> reasoning blocks - <tool_call>...</tool_call> Qwen/Hermes-style tool calls - <function=NAME>...</function> Llama-style tool calls - <parameter=NAME>...</parameter> Llama-style parameters - Stray closing tags outside their expected context Uses a state machine with a carry buffer to handle tags split across token boundaries. Raw text is still accumulated for tool-call parsing, so structured tool_use blocks emit correctly.
Tests cover all tag suppression patterns (<think>, <tool_call>, <function=NAME>, <parameter=NAME>), stray closing tags, carry buffer behavior, state machine transitions, flush semantics, edge cases, and realistic token-by-token streaming scenarios.
Two improvements to _AnthropicStreamScrubber: 1. Always enable the scrubber for Anthropic streaming, not only when tools are present. Reasoning models emit <think> tags regardless of whether tools are in the request. 2. Make carry buffer conditional: only retain a suffix when there is a '<' near the tail that could be the start of a split tag. Plain text with no angle brackets now streams with zero latency instead of being held back by CARRY_N chars. Tests updated accordingly (91 tests, all passing).
When clients set thinking: {type: 'enabled'} in /v1/messages requests,
the server now emits proper Anthropic thinking content blocks instead
of suppressing <think> content:
- content_block_start with type 'thinking'
- content_block_delta with thinking_delta events
- content_block_stop when thinking ends
- Then normal text_delta on a separate content block
This enables clients like Cline to render thinking in grey text
while keeping user-facing text on its own channel.
Implementation:
- AnthropicRequest now accepts a 'thinking' field
- _AnthropicStreamRouter routes <think> content to thinking_delta
(vs _AnthropicStreamScrubber which suppresses it)
- _is_thinking_enabled() detects client preference
- Dynamic content block indexing (thinking=0, text=1 when enabled)
- When thinking is NOT requested, existing scrubber behavior preserved
114 tests passing (91 scrubber + 23 router/thinking).
When thinking is enabled, emit content_block_start for BOTH the thinking block (index 0) and text block (index 1) upfront in message_start, rather than lazily. This matches the Anthropic streaming protocol that clients like Cline expect. Deltas are then routed to the correct index: - thinking content → index 0, thinking_delta - text content → index 1, text_delta The thinking_start signal from the router is now a no-op since the block is already open.
The model's chat template injects <think> into the prompt, so the model's first output IS thinking content — no <think> tag appears in the output, only </think> to end the thinking phase. The router now starts in IN_THINK mode (start_in_thinking=True) when thinking is enabled, so initial output immediately routes to thinking_delta events on index 0.
…<tool_call> markup Add AnthropicStreamScrubber with state machine to strip model-internal markup from /v1/messages streaming. Add AnthropicStreamRouter for proper thinking block support when client requests it. 91 scrubber tests + 23 router/thinking tests included. Cherry-picked from: waybarrios#132 Original author: 0age Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Delete a python's import that it is not used: 1a27261 |
|
Went through the scrubber and router implementation pretty carefully. Overall this is solid work, the state machine approach is clean and the test coverage is impressive (83 tests). Here's what I found though: Carry buffer can grow unbounded When
The regexes in
The router is always constructed with
When
Both Router code duplicates scrubber logic The router's Non-streaming path still leaks The non-streaming I also pushed a couple of commits to fix the lint failures (unused pytest import and black formatting). |
waybarrios
left a comment
There was a problem hiding this comment.
Solid work overall, the state machine is clean and test coverage is great. Left some inline comments on specific things I noticed.
vllm_mlx/server.py
Outdated
| # Prefix tag found but closing '>' missing – truncated. | ||
| if pos > i: | ||
| out.append(s[i:pos]) | ||
| self.carry = s[pos:] |
There was a problem hiding this comment.
If the closing > never arrives (model outputs something like You can use <function= then keeps generating text), this carry grows unbounded with every new delta. Each feed() call appends the new delta to the carry, and since > is still missing, the same path runs again storing everything from pos onward.
Might want to cap this at something like MAX_TAG * 2 and just emit it as literal text if > hasn't shown up by then.
vllm_mlx/server.py
Outdated
| # Strip any residual prefix tags (e.g. ``<function=foo>``). | ||
| import re | ||
|
|
||
| result = re.sub(r"<function=[^>]*>", "", result) |
There was a problem hiding this comment.
These regexes require a closing > to match. If the stream ends with carry holding something like <function=myFunc (no >), neither regex catches it and the raw markup leaks to the client.
Same issue exists in the router's flush() at line 2030.
Would need an extra pattern like <function=[^>]*$ and <parameter=[^>]*$ to catch partial prefixes at end of stream.
vllm_mlx/server.py
Outdated
| # Use the stream router which yields typed (kind, text) pieces | ||
| # that separate thinking content from user-facing text. | ||
| router: _AnthropicStreamRouter | None = _AnthropicStreamRouter( | ||
| start_in_thinking=True |
There was a problem hiding this comment.
This hardcodes start_in_thinking=True assuming every thinking-capable model has its template pre-inject <think>. If a model answers directly without a <think> block, all output gets routed as thinking_delta and the text block stays empty. If it emits <think> explicitly as its first token, the literal string leaks into thinking content since the router is already in IN_THINK mode.
Might be safer to default to False and let the router detect <think> in-stream, unless you're sure all supported models inject it via template.
vllm_mlx/server.py
Outdated
| THINK_OPEN: "IN_THINK", | ||
| TOOL_OPEN: "IN_TOOLCALL", | ||
| FUNC_PREFIX: "IN_FUNCTION", | ||
| PARAM_PREFIX: "IN_FUNCTION", # parameters inside function blocks |
There was a problem hiding this comment.
When <parameter=name> shows up standalone (outside a <function> block), it transitions to IN_FUNCTION which only closes on </function>, not </parameter>. So everything after a standalone <parameter= gets eaten until </function> appears or the stream ends.
Might want either a separate IN_PARAMETER state or add </parameter> as an alternative closer for IN_FUNCTION.
vllm_mlx/server.py
Outdated
| # <think> into the prompt, so the first output IS thinking content). | ||
| self.mode: str = "IN_THINK" if start_in_thinking else "TEXT" | ||
| self.carry: str = "" | ||
| self._implicit_think = start_in_thinking |
There was a problem hiding this comment.
This attribute is set here but never read anywhere in the class. Dead code that can be removed.
vllm_mlx/server.py
Outdated
| for tag in self._EXACT_TAGS: | ||
| result = result.replace(tag, "") | ||
| # Strip any residual prefix tags (e.g. ``<function=foo>``). | ||
| import re |
There was a problem hiding this comment.
re is already used elsewhere in server.py, this should go at the top of the file with the other imports instead of inline inside the method. Same thing at line 2028.
|
@0age nice work on this one! The scrubber approach is really well thought out and the test coverage is solid. I left a few inline comments on things I noticed while going through it, nothing blocking but worth a look when you get a chance. Also pushed a couple small commits to fix the lint failures (unused import + black formatting). Let me know if you have any questions about the comments! |
janhilgard
left a comment
There was a problem hiding this comment.
Solid work — the state machine approach is the right call for streaming, and the test coverage is impressive.
A few things beyond what @waybarrios already flagged:
SPECIAL_TOKENS_PATTERN already strips <tool_call> tags
In vllm_mlx/api/utils.py, SPECIAL_TOKENS_PATTERN already matches </?tool_call>|</?tool_call_reasoning>. Since _stream_anthropic_messages applies this pattern before feeding content to the scrubber (via clean_output_text()), the scrubber will never actually see <tool_call> or </tool_call> tags — they're already gone.
This means the scrubber's IN_TOOLCALL mode is dead code in practice. Either:
- Remove
<tool_call>fromSPECIAL_TOKENS_PATTERNand let the scrubber handle it (cleaner separation of concerns), or - Remove
IN_TOOLCALLfrom the scrubber and document thatSPECIAL_TOKENS_PATTERNhandles it
I'd lean toward (1) — the scrubber is the right place for semantic tag stripping, and SPECIAL_TOKENS_PATTERN should only handle tokenizer-level special tokens.
Scrubber/Router duplication
_AnthropicStreamRouter.feed() duplicates ~80% of _AnthropicStreamScrubber.feed(). The only difference is that the router yields ("thinking", text) pieces instead of suppressing. Consider:
- A single class with an
emit_thinking: boolparameter - Or have the Router subclass the Scrubber, overriding only the
IN_THINKhandling
This would halve the code and ensure bug fixes apply to both paths.
Carry growth cap for prefix tags
To expand on @waybarrios's unbounded carry comment: when <function= is found but > never arrives, each feed() call does s = self.carry + delta, finds the prefix again at position 0, and stores everything from there as carry. A simple fix:
if len(self.carry) > self.CARRY_N + 256:
# Prefix tag never closed — treat as plain text
out.append(self.carry)
self.carry = ""Overall this is a welcome fix for a real usability problem. The core logic is sound — the issues above are about reducing surface area and hardening edge cases.
- Refactor _AnthropicStreamRouter as subclass of _AnthropicStreamScrubber, eliminating ~120 lines of duplicated state-machine logic via shared _feed_pieces()/_flush_pieces() methods. - Add dedicated IN_PARAMETER state so <parameter=x>...</parameter> closes on </parameter> instead of incorrectly requiring </function>. - Add MAX_CARRY cap to prevent unbounded carry buffer growth when a prefix tag like <function= never receives its closing >. - Fix flush() to strip incomplete prefix tags using regex. - Move 'import re' to module-level instead of inside flush(). - Change start_in_thinking default from True to False. - Add comprehensive tests for carry cap, flush with incomplete prefix tags, IN_PARAMETER state, and Router-inherits-Scrubber verification.
|
@waybarrios & @janhilgard thanks for the reviews! let me know if there's anything else you think needs to be addressed |
This PR implements #129 and includes test coverage.
Fix (6805baf): Adds _AnthropicStreamScrubber, a stateful stream scrubber that strips …, <tool_call>…</tool_call>, <function=NAME>…, and <parameter=NAME>… markup from Anthropic /v1/messages streaming text deltas. When tools are present in the request, the scrubber is activated so clients only see clean text and structured tool_use blocks. Handles tags split across token boundaries via a carry buffer.
Tests (c1f7cac): Adds 83 unit tests for _AnthropicStreamScrubber covering:
Suppression of all tag types (single delta, split across boundaries, nested)
Stray closing tag removal
Carry buffer correctness when tags are split character-by-character
State machine transitions (TEXT ↔ IN_THINK / IN_TOOLCALL / IN_FUNCTION)
flush() semantics (emit vs discard based on mode, residual tag stripping)
Edge cases (empty blocks, unicode, very long content, angle brackets in plain text)
Realistic token-by-token streaming simulations
Pure logic tests with no model loading required.