perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens) by penumbraforge · Pull Request #234 · waybarrios/vllm-mlx

penumbraforge · 2026-03-29T22:02:43Z

Summary

Replaces the accumulated text scanning in BaseThinkingReasoningParser.extract_reasoning_streaming() with a state machine that only inspects the delta text per token.

Before: Four in checks against the full accumulated text every token (start_token in previous_text, start_token in current_text, end_token in previous_text, end_token in delta_text). This is O(N) per token and O(N²) over a generation.

After: Three-phase state machine (pre_think → thinking → content) that checks only the delta for tag transitions. O(1) per token regardless of output length.

Motivation

The current streaming parser is stateless — it rescans the full accumulated output on every token to determine the reasoning/content phase. While the overhead is small at typical output lengths (< 1ms at 2k tokens), it scales quadratically:

Output Length	Old (text scan)	New (state machine)	Speedup
500 tokens	0.37ms	0.04ms	8.6x
2,000 tokens	5.28ms	0.28ms	19x
10,000 tokens	141ms	10ms	14x

At 50+ tok/s on Apple Silicon, the parser overhead at 10k+ tokens (141ms) starts to become noticeable. The state machine keeps overhead constant at any length.

More importantly, this is an architectural improvement — the parser no longer depends on accumulated text at all, which simplifies reasoning about correctness and opens the door to removing the accumulated_text concatenation in the server's streaming loop in a future change.

Changes

vllm_mlx/reasoning/think_parser.py: Rewrote extract_reasoning_streaming() as a state machine with _phase tracking. reset_state() initializes the phase. All three input scenarios preserved:
1. Explicit <think>...</think> in output
2. Implicit mode (<think> in prompt, only </think> in output)
3. No tags (pure content)
Method signature unchanged — previous_text and current_text are accepted but the state machine doesn't need them, maintaining backward compatibility with callers and subclasses.
benchmarks/bench_reasoning_parser.py: Streaming parser benchmark measuring per-token overhead at various output lengths.

Compatibility

extract_reasoning() (non-streaming, complete output) is unchanged
Method signatures unchanged — existing callers (server.py, subclasses) work without modification
All three Qwen3/DeepSeek/Harmony parser subclasses inherit the fix automatically
reset_state() is already called by server.py before each request

Testing

Verified correct phase transitions across all scenarios:

<think> → reasoning tokens → </think> → content tokens
Implicit mode (no <think>, only </think> in output)
No tags (default to reasoning per existing behavior)
Both tags in single delta token (edge case)

…n streaming parser The streaming reasoning parser (BaseThinkingReasoningParser) scans the full accumulated output text for <think>/<think> on every token via `in` checks on previous_text and current_text. This is O(N) per token and O(N²) over a full generation, becoming measurable at longer outputs (5ms+ at 2k tokens, 141ms at 10k tokens). Replace with a three-phase state machine (pre_think → thinking → content) that tracks transitions using only the delta text. Each token is now O(1) regardless of output length. Benchmark results (streaming parser overhead, simulated server loop): Tokens Old (scan) New (state) Speedup ------ ---------- ----------- ------- 500 0.37ms 0.04ms 8.6x 1000 1.38ms 0.10ms 13.5x 2000 5.28ms 0.28ms 19.1x 5000 34.03ms 2.05ms 16.6x 10000 141.26ms 10.16ms 13.9x At 50 tok/s decode on Apple Silicon, each token has a 20ms budget. The old parser consumed 0.3ms/tok at 2k tokens and 1.4ms/tok at 10k — up to 7% of the budget on overhead alone. The new parser is <0.01ms/tok at any length. Changes: - think_parser.py: Rewrote extract_reasoning_streaming() as a state machine with _phase tracking. reset_state() initializes the phase. All three scenarios preserved (explicit tags, implicit mode, no tags). Method signature unchanged for backward compatibility. - benchmarks/bench_reasoning_parser.py: Added streaming parser benchmark. No changes to extract_reasoning() (non-streaming path) — it only runs once per request and is not on the hot path.

Thump604 · 2026-03-30T05:44:21Z

+1, this is a clean improvement. The O(N^2) rescanning in the current parser is unnecessary overhead that compounds on long thinking outputs.

We run Qwen3.5-122B in production with thinking enabled -- outputs regularly hit 1-2K reasoning tokens per response. The state machine approach is the right call for streaming parsers.

One question: does the state machine handle the case where <think> appears mid-content (not at the start)? The current qwen3 reasoning parser assumes thinking always comes first, but edge cases with interleaved thinking (enabled on Opus 4.6 models) could hit this path differently. If the state machine only transitions forward (pre_think -> thinking -> content), that matches our production behavior.

Would be good to see this benchmarked on a real model endpoint, not just synthetic data -- the actual bottleneck may shift when network I/O dominates.

Thump604

Critical Issue: Partial Tag Boundaries Across Token Chunks

The state machine fails when a tag is split across delta boundaries. Example:

Token N: delta_text = "...model thinks <thi"
Token N+1: delta_text = "nk> the answer is..."

Neither delta contains the complete <think> tag, so both substring checks fail:

"<think>" in "<thi" → False
"<think>" in "nk> the answer" → False

Result: the phase never transitions from pre_think to thinking, and reasoning gets misclassified as content.

Why This Happens

vllm streams tokens as they are generated by the model. A token boundary can split XML tags arbitrarily. The old code avoided this by checking current_text (accumulated), which would contain the complete tag once the full boundary is crossed. The PR removes that check to achieve O(1), introducing the regression.

Test Case

I created a test showing the failure mode:

delta 1: " <thi" → tag not found → phase stays "pre_think"
delta 2: "nk> about..." → tag not found → phase stays "pre_think"  ❌

Compare to complete tags:

delta 1: " <think>" → tag found → phase = "thinking" ✓

Fix Options

Keep using current_text for tag detection (revert the optimization where it matters):
- Check start_token in current_text instead of start_token in delta_text
- Still maintains state machine for phase tracking (not O(N²) rescanning)
- Minimal overhead, solves the split-tag bug
- Cost: loses pure O(1) claim, but O(delta_len) per token is still fast
Accumulate partial tags across deltas (buffer incomplete tags):
- Keep a _partial_token buffer
- Combine _partial_token + delta_text before tag matching
- More complex, requires careful state management
Whitelist known tokenizers and assume tags never split:
- Document the assumption
- Risk: fails silently if tokenizer changes or hypothesis is wrong

I recommend option 1 — it preserves the architectural improvement (state machine) while keeping correctness. The overhead of one substring scan per token is negligible compared to inference latency.

Secondary Issue: O(1) Claim

Minor: the PR claims O(1) but uses str.find() which is O(k) where k = delta length. The speedup is really O(accumulated_len) / O(delta_len), not O(1). For typical deltas (~3-5 tokens), this is near-constant but the terminology should match reality.

Bonus: Benchmark Correctness

The benchmark avoids the split-tag issue by constructing deltas manually. Real streaming from vllm may hit it depending on tokenizer boundaries. Worth testing against actual model output.

Thump604

Clean refactor. The O(N) per-token text scanning was a real issue for long reasoning outputs (thousands of tokens of thinking before content). The state machine approach is correct: three phases, transitions on tag detection in delta only.

A few notes:

reset_state() needs to be called at the start of each request. Verified this is handled by the parser lifecycle in the server.
The benchmark is a useful addition for regression testing.
The method signature keeping previous_text/current_text for backward compatibility is the right call.

Tested with Qwen3.5-122B in production with thinking enabled. No regressions observed.

Thump604 · 2026-04-07T23:55:02Z

@waybarrios, @penumbraforge: brief endorsement.

The perf claim is structurally sound. The old code path performs four in substring scans against the full accumulated text per token, which is O(N) per token and O(N²) over the generation. Replacing it with a three-state state machine that only inspects the delta text is the canonical fix, and the 13-19x speedup at 2k+ tokens is the expected magnitude.

Two suggestions for the test side (not blocking):

A regression test that asserts the new state-machine output is byte-identical to the old text-scan output across a representative set of (start_token, end_token, content) cases would close the loop on correctness.
A perf assertion in CI (e.g. "10000 tokens through the parser must complete in under 500ms") would prevent a future refactor from regressing the speedup silently.

Mergeable on current main per the PR JSON.

Thump604 · 2026-04-08T03:04:32Z

@penumbraforge follow-up to my earlier endorsement: one real concern from a closer pass through the state machine code.

The new _phase attribute is per-instance state that needs to be reset at the start of each new streaming request. The PR adds reset_state() for this, but I do not see the call wired into the streaming entry point in the diff.

If Qwen3ReasoningParser (and other BaseThinkingReasoningParser subclasses) are instantiated as singletons by the server (which is the typical pattern for parser plugins loaded by name at startup), then _phase state from one request will leak into the next request. The first requests terminal phase becomes the second requests initial phase, which breaks the state machine.

Two ways to fix this:

Call reset_state() at the start of every streaming request — wire it into wherever stream_chat_completion() initializes the parser invocation. This is the explicit fix.
Make the parser stateless externally by passing the phase in and out of extract_reasoning_streaming(), e.g. via a small ParserState dataclass that the caller owns per-request. More invasive but more robust.

Option 1 is simpler and matches the PRs intent. Worth verifying that reset_state()` is actually called per request before this lands. If the parser is in fact instantiated per-request rather than singleton, this is a non-issue and the concern can be ignored.

The old code had no state at all (text-based detection on every call), so this is a genuinely new failure mode the state-machine refactor introduces. Worth catching before merge rather than after.

The original perf endorsement still stands: the algorithmic improvement from O(N²) to O(1) per token is real and valuable. Just want to make sure correctness across requests is preserved.

waybarrios

Good architectural improvement — the state machine is the right approach. Two things before merging:

1. Split-tag bug (raised by @Thump604)

The state machine only checks delta_text for tags, which breaks when a tag is split across token boundaries:

# Token N:   delta = "...some text <thi"   → "<think>" not in delta → no transition
# Token N+1: delta = "nk> reasoning..."    → "<think>" not in delta → no transition
# Result: _phase stays "pre_think", reasoning is never detected

In practice <think> is usually a single token in Qwen3/DeepSeek tokenizers, but it's not guaranteed — and the old code handled this correctly by checking current_text (accumulated).

2. Suggested fix: hybrid approach (Thump604's option 1)

Use current_text for tag detection, keep the state machine for phase tracking. This is O(delta_len) per token instead of O(1), but still eliminates the O(N) rescanning that causes the quadratic blowup:

def extract_reasoning_streaming(
    self,
    previous_text: str,
    current_text: str,
    delta_text: str,
) -> DeltaMessage | None:
    if not delta_text:
        return None

    start_tok = self.start_token
    end_tok = self.end_token

    # ── Phase: pre_think ──────────────────────────────────
    if self._phase == "pre_think":
        # Use current_text (accumulated) for tag detection to handle
        # tags split across delta boundaries
        if start_tok in current_text:
            self._phase = "thinking"
            idx = delta_text.find(start_tok)
            if idx >= 0:
                after = delta_text[idx + len(start_tok):]
            else:
                # Tag completed in this delta but started in previous
                after = delta_text

            if end_tok in after:
                self._phase = "content"
                eidx = after.find(end_tok)
                reasoning = after[:eidx]
                content = after[eidx + len(end_tok):]
                return DeltaMessage(
                    reasoning=reasoning or None,
                    content=content or None,
                )
            return DeltaMessage(reasoning=after) if after else None

        if end_tok in current_text:
            self._phase = "content"
            idx = delta_text.find(end_tok)
            if idx >= 0:
                reasoning = delta_text[:idx]
                content = delta_text[idx + len(end_tok):]
            else:
                reasoning = None
                content = delta_text
            return DeltaMessage(
                reasoning=reasoning or None,
                content=content or None,
            )

        return DeltaMessage(reasoning=delta_text)

    # ── Phase: thinking ───────────────────────────────────
    if self._phase == "thinking":
        if end_tok in current_text and end_tok not in previous_text:
            self._phase = "content"
            idx = delta_text.find(end_tok)
            if idx >= 0:
                reasoning = delta_text[:idx]
                content = delta_text[idx + len(end_tok):]
            else:
                reasoning = delta_text
                content = None
            return DeltaMessage(
                reasoning=reasoning or None,
                content=content or None,
            )
        if self._phase == "content":
            return DeltaMessage(content=delta_text)
        return DeltaMessage(reasoning=delta_text)

    # ── Phase: content ────────────────────────────────────
    return DeltaMessage(content=delta_text)

This keeps the state machine (no more _handle_explicit_think / _handle_implicit_think spaghetti), preserves the perf win (no O(N) rescanning per phase), and handles split tags correctly.

Note: `reset_state()` is already wired

Verified that server.py calls _reasoning_parser.reset_state() at lines 1905 and 2143 before each streaming request, so the singleton concern is covered.

Thump604 · 2026-04-11T02:42:48Z

I pushed the split-tag fix in 3c00c44 onto this PR branch.

The state machine stays in place, but tag completion now uses accumulated text so <think> / </think> still transition correctly when the tag is split across delta boundaries. I used the existing realistic streaming parser tests as the repro and they now pass again, including the split-token cases and the full tests/test_reasoning_parser.py file.

janhilgard · 2026-04-11T16:09:32Z

Nice optimization @penumbraforge — the O(N²) → O(1) streaming parser is a clean architectural win. The state machine approach is correct and the benchmark data is convincing (19x at 2k tokens, relevant at 50+ tok/s on Apple Silicon).

Good to see it's backward-compatible with existing subclasses (Qwen3, DeepSeek, Harmony) and no merge conflicts with current main. LGTM.

Incorporates 53 upstream commits including: - O(1) state-machine reasoning parser (PR waybarrios#234) - Resumable model download (PR waybarrios#77) - Block-aware prefix cache (PR waybarrios#217) - Message normalization (PR waybarrios#240) - Full sampling params (PR waybarrios#258) - ThinkRouter for Anthropic streaming - 22 new test files - License file, docs updates Conflict resolution: preserved production features (frequency_penalty conversion, tool markup safety nets, openai_to_anthropic import) while adopting upstream improvements (Gemma4 parser rewrite, cleaner logging, _model_name in streaming chunks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Thump604 mentioned this pull request Mar 31, 2026

Looking for collaborators #238

Closed

Thump604 reviewed Mar 31, 2026

View reviewed changes

Thump604 approved these changes Apr 1, 2026

View reviewed changes

waybarrios reviewed Apr 11, 2026

View reviewed changes

fix(reasoning): detect split think tags across deltas

3c00c44

Thump604 merged commit 1ffc4cf into waybarrios:main Apr 11, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens)#234

perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens)#234
Thump604 merged 2 commits intowaybarrios:mainfrom
penumbraforge:perf/reasoning-parser-state-machine

penumbraforge commented Mar 29, 2026 •

edited

Loading

Uh oh!

Thump604 commented Mar 30, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

waybarrios left a comment

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

penumbraforge commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Compatibility

Testing

Uh oh!

Thump604 commented Mar 30, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Critical Issue: Partial Tag Boundaries Across Token Chunks

Why This Happens

Test Case

Fix Options

Secondary Issue: O(1) Claim

Bonus: Benchmark Correctness

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

1. Split-tag bug (raised by @Thump604)

2. Suggested fix: hybrid approach (Thump604's option 1)

Note: reset_state() is already wired

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

penumbraforge commented Mar 29, 2026 •

edited

Loading

Note: `reset_state()` is already wired