perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens)#234
Conversation
…n streaming parser
The streaming reasoning parser (BaseThinkingReasoningParser) scans the full
accumulated output text for <think>/<think> on every token via `in` checks on
previous_text and current_text. This is O(N) per token and O(N²) over a full
generation, becoming measurable at longer outputs (5ms+ at 2k tokens, 141ms
at 10k tokens).
Replace with a three-phase state machine (pre_think → thinking → content) that
tracks transitions using only the delta text. Each token is now O(1) regardless
of output length.
Benchmark results (streaming parser overhead, simulated server loop):
Tokens Old (scan) New (state) Speedup
------ ---------- ----------- -------
500 0.37ms 0.04ms 8.6x
1000 1.38ms 0.10ms 13.5x
2000 5.28ms 0.28ms 19.1x
5000 34.03ms 2.05ms 16.6x
10000 141.26ms 10.16ms 13.9x
At 50 tok/s decode on Apple Silicon, each token has a 20ms budget. The old parser
consumed 0.3ms/tok at 2k tokens and 1.4ms/tok at 10k — up to 7% of the budget
on overhead alone. The new parser is <0.01ms/tok at any length.
Changes:
- think_parser.py: Rewrote extract_reasoning_streaming() as a state machine with
_phase tracking. reset_state() initializes the phase. All three scenarios
preserved (explicit tags, implicit mode, no tags). Method signature unchanged
for backward compatibility.
- benchmarks/bench_reasoning_parser.py: Added streaming parser benchmark.
No changes to extract_reasoning() (non-streaming path) — it only runs once per
request and is not on the hot path.
|
+1, this is a clean improvement. The O(N^2) rescanning in the current parser is unnecessary overhead that compounds on long thinking outputs. We run Qwen3.5-122B in production with thinking enabled -- outputs regularly hit 1-2K reasoning tokens per response. The state machine approach is the right call for streaming parsers. One question: does the state machine handle the case where Would be good to see this benchmarked on a real model endpoint, not just synthetic data -- the actual bottleneck may shift when network I/O dominates. |
Thump604
left a comment
There was a problem hiding this comment.
Critical Issue: Partial Tag Boundaries Across Token Chunks
The state machine fails when a tag is split across delta boundaries. Example:
Token N: delta_text = "...model thinks <thi"
Token N+1: delta_text = "nk> the answer is..."
Neither delta contains the complete <think> tag, so both substring checks fail:
"<think>" in "<thi"→ False"<think>" in "nk> the answer"→ False
Result: the phase never transitions from pre_think to thinking, and reasoning gets misclassified as content.
Why This Happens
vllm streams tokens as they are generated by the model. A token boundary can split XML tags arbitrarily. The old code avoided this by checking current_text (accumulated), which would contain the complete tag once the full boundary is crossed. The PR removes that check to achieve O(1), introducing the regression.
Test Case
I created a test showing the failure mode:
delta 1: " <thi" → tag not found → phase stays "pre_think"
delta 2: "nk> about..." → tag not found → phase stays "pre_think" ❌
Compare to complete tags:
delta 1: " <think>" → tag found → phase = "thinking" ✓
Fix Options
-
Keep using
current_textfor tag detection (revert the optimization where it matters):- Check
start_token in current_textinstead ofstart_token in delta_text - Still maintains state machine for phase tracking (not O(N²) rescanning)
- Minimal overhead, solves the split-tag bug
- Cost: loses pure O(1) claim, but O(delta_len) per token is still fast
- Check
-
Accumulate partial tags across deltas (buffer incomplete tags):
- Keep a
_partial_tokenbuffer - Combine
_partial_token + delta_textbefore tag matching - More complex, requires careful state management
- Keep a
-
Whitelist known tokenizers and assume tags never split:
- Document the assumption
- Risk: fails silently if tokenizer changes or hypothesis is wrong
I recommend option 1 — it preserves the architectural improvement (state machine) while keeping correctness. The overhead of one substring scan per token is negligible compared to inference latency.
Secondary Issue: O(1) Claim
Minor: the PR claims O(1) but uses str.find() which is O(k) where k = delta length. The speedup is really O(accumulated_len) / O(delta_len), not O(1). For typical deltas (~3-5 tokens), this is near-constant but the terminology should match reality.
Bonus: Benchmark Correctness
The benchmark avoids the split-tag issue by constructing deltas manually. Real streaming from vllm may hit it depending on tokenizer boundaries. Worth testing against actual model output.
Thump604
left a comment
There was a problem hiding this comment.
Clean refactor. The O(N) per-token text scanning was a real issue for long reasoning outputs (thousands of tokens of thinking before content). The state machine approach is correct: three phases, transitions on tag detection in delta only.
A few notes:
reset_state()needs to be called at the start of each request. Verified this is handled by the parser lifecycle in the server.- The benchmark is a useful addition for regression testing.
- The method signature keeping
previous_text/current_textfor backward compatibility is the right call.
Tested with Qwen3.5-122B in production with thinking enabled. No regressions observed.
|
@waybarrios, @penumbraforge: brief endorsement. The perf claim is structurally sound. The old code path performs four Two suggestions for the test side (not blocking):
Mergeable on current main per the PR JSON. |
|
@penumbraforge follow-up to my earlier endorsement: one real concern from a closer pass through the state machine code. The new If Two ways to fix this:
Option 1 is simpler and matches the PR The old code had no state at all (text-based detection on every call), so this is a genuinely new failure mode the state-machine refactor introduces. Worth catching before merge rather than after. The original perf endorsement still stands: the algorithmic improvement from O(N²) to O(1) per token is real and valuable. Just want to make sure correctness across requests is preserved. |
waybarrios
left a comment
There was a problem hiding this comment.
Good architectural improvement — the state machine is the right approach. Two things before merging:
1. Split-tag bug (raised by @Thump604)
The state machine only checks delta_text for tags, which breaks when a tag is split across token boundaries:
# Token N: delta = "...some text <thi" → "<think>" not in delta → no transition
# Token N+1: delta = "nk> reasoning..." → "<think>" not in delta → no transition
# Result: _phase stays "pre_think", reasoning is never detectedIn practice <think> is usually a single token in Qwen3/DeepSeek tokenizers, but it's not guaranteed — and the old code handled this correctly by checking current_text (accumulated).
2. Suggested fix: hybrid approach (Thump604's option 1)
Use current_text for tag detection, keep the state machine for phase tracking. This is O(delta_len) per token instead of O(1), but still eliminates the O(N) rescanning that causes the quadratic blowup:
def extract_reasoning_streaming(
self,
previous_text: str,
current_text: str,
delta_text: str,
) -> DeltaMessage | None:
if not delta_text:
return None
start_tok = self.start_token
end_tok = self.end_token
# ── Phase: pre_think ──────────────────────────────────
if self._phase == "pre_think":
# Use current_text (accumulated) for tag detection to handle
# tags split across delta boundaries
if start_tok in current_text:
self._phase = "thinking"
idx = delta_text.find(start_tok)
if idx >= 0:
after = delta_text[idx + len(start_tok):]
else:
# Tag completed in this delta but started in previous
after = delta_text
if end_tok in after:
self._phase = "content"
eidx = after.find(end_tok)
reasoning = after[:eidx]
content = after[eidx + len(end_tok):]
return DeltaMessage(
reasoning=reasoning or None,
content=content or None,
)
return DeltaMessage(reasoning=after) if after else None
if end_tok in current_text:
self._phase = "content"
idx = delta_text.find(end_tok)
if idx >= 0:
reasoning = delta_text[:idx]
content = delta_text[idx + len(end_tok):]
else:
reasoning = None
content = delta_text
return DeltaMessage(
reasoning=reasoning or None,
content=content or None,
)
return DeltaMessage(reasoning=delta_text)
# ── Phase: thinking ───────────────────────────────────
if self._phase == "thinking":
if end_tok in current_text and end_tok not in previous_text:
self._phase = "content"
idx = delta_text.find(end_tok)
if idx >= 0:
reasoning = delta_text[:idx]
content = delta_text[idx + len(end_tok):]
else:
reasoning = delta_text
content = None
return DeltaMessage(
reasoning=reasoning or None,
content=content or None,
)
if self._phase == "content":
return DeltaMessage(content=delta_text)
return DeltaMessage(reasoning=delta_text)
# ── Phase: content ────────────────────────────────────
return DeltaMessage(content=delta_text)This keeps the state machine (no more _handle_explicit_think / _handle_implicit_think spaghetti), preserves the perf win (no O(N) rescanning per phase), and handles split tags correctly.
Note: reset_state() is already wired
Verified that server.py calls _reasoning_parser.reset_state() at lines 1905 and 2143 before each streaming request, so the singleton concern is covered.
|
I pushed the split-tag fix in The state machine stays in place, but tag completion now uses accumulated text so |
|
Nice optimization @penumbraforge — the O(N²) → O(1) streaming parser is a clean architectural win. The state machine approach is correct and the benchmark data is convincing (19x at 2k tokens, relevant at 50+ tok/s on Apple Silicon). Good to see it's backward-compatible with existing subclasses (Qwen3, DeepSeek, Harmony) and no merge conflicts with current |
Incorporates 53 upstream commits including: - O(1) state-machine reasoning parser (PR waybarrios#234) - Resumable model download (PR waybarrios#77) - Block-aware prefix cache (PR waybarrios#217) - Message normalization (PR waybarrios#240) - Full sampling params (PR waybarrios#258) - ThinkRouter for Anthropic streaming - 22 new test files - License file, docs updates Conflict resolution: preserved production features (frequency_penalty conversion, tool markup safety nets, openai_to_anthropic import) while adopting upstream improvements (Gemma4 parser rewrite, cleaner logging, _model_name in streaming chunks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Replaces the accumulated text scanning in
BaseThinkingReasoningParser.extract_reasoning_streaming()with a state machine that only inspects the delta text per token.Before: Four
inchecks against the full accumulated text every token (start_token in previous_text,start_token in current_text,end_token in previous_text,end_token in delta_text). This is O(N) per token and O(N²) over a generation.After: Three-phase state machine (
pre_think→thinking→content) that checks only the delta for tag transitions. O(1) per token regardless of output length.Motivation
The current streaming parser is stateless — it rescans the full accumulated output on every token to determine the reasoning/content phase. While the overhead is small at typical output lengths (< 1ms at 2k tokens), it scales quadratically:
At 50+ tok/s on Apple Silicon, the parser overhead at 10k+ tokens (141ms) starts to become noticeable. The state machine keeps overhead constant at any length.
More importantly, this is an architectural improvement — the parser no longer depends on accumulated text at all, which simplifies reasoning about correctness and opens the door to removing the
accumulated_textconcatenation in the server's streaming loop in a future change.Changes
vllm_mlx/reasoning/think_parser.py: Rewroteextract_reasoning_streaming()as a state machine with_phasetracking.reset_state()initializes the phase. All three input scenarios preserved:<think>...</think>in output<think>in prompt, only</think>in output)Method signature unchanged —
previous_textandcurrent_textare accepted but the state machine doesn't need them, maintaining backward compatibility with callers and subclasses.benchmarks/bench_reasoning_parser.py: Streaming parser benchmark measuring per-token overhead at various output lengths.Compatibility
extract_reasoning()(non-streaming, complete output) is unchangedreset_state()is already called by server.py before each requestTesting
Verified correct phase transitions across all scenarios:
<think>→ reasoning tokens →</think>→ content tokens<think>, only</think>in output)