[Bugfix] Validate post-reasoning structured output tokens in spec decode by liuyanyi · Pull Request #40962 · vllm-project/vllm

liuyanyi · 2026-04-27T04:55:48Z

Purpose

This PR fixes the interaction between speculative decoding, reasoning parsers, and structured output.

When speculative decoding accepts multiple tokens in one scheduler step, the accepted token delta may cross the reasoning boundary. Tokens before the reasoning-end marker are still reasoning tokens and should not be constrained by the structured-output grammar, but tokens after the boundary are answer tokens and must be grammar-validated before being committed.

This PR adds an opt-in scheduler path that:

detects when a reasoning-end marker completes inside accepted speculative tokens;
validates only the post-boundary answer suffix with the structured-output grammar;
truncates rejected post-boundary speculative tokens before appending them to request output;
keeps scheduler accounting consistent when grammar validation rejects speculative tokens.

I’m not particularly familiar with the full scheduler or the spec path, so I’m not sure if this change will have any negative impact on the scheduler as a whole. I’d appreciate it if reviewer could take a look.

Relation to #36138

This PR is related to #36138, which addresses the same user-visible bug: when speculative decoding accepts a batch that contains the reasoning-end marker, grammar constraints may be skipped for the answer tokens that appear after that marker.

The two PRs differ mainly in where they solve the problem and how much of the speculative-decoding pipeline they change.

#36138 treats this as a general “mixed constrained/unconstrained speculative batch” problem. It splits draft/speculative token batches into unconstrained and constrained parts, then reuses that split across multiple grammar interaction points:

update_from_output();
update_draft_token_ids();
update_draft_token_ids_in_output();
grammar_bitmask().

It also makes grammar_bitmask() reasoning-aware at the per-position level: positions before and including the reasoning-end marker are left unconstrained, while positions after the marker are grammar-constrained. To compute those later bitmasks, it may temporarily advance grammar state over scheduled speculative tokens and then roll the grammar back.

This PR intentionally takes a narrower approach.

Instead of changing draft-token validation and bitmask generation, it handles the issue at the accepted-token commit point in Scheduler.update_from_output(). At that point, the target model has already accepted the speculative tokens, and the scheduler is about to append them to the final request output. If the accepted token delta crosses the reasoning boundary, this PR validates only the post-boundary suffix, truncates rejected suffix tokens, and commits only the valid answer-phase suffix to the grammar state.

The tradeoff is intentional: #36138 is a broader pipeline-level fix that attempts to make speculative grammar constraints correct earlier, including bitmask construction. This PR is a smaller, opt-in, commit-time validation fix. It does not try to produce mixed-position grammar masks for the same speculative batch; instead, it guarantees that answer-phase tokens are grammar-validated before they enter the final output and before grammar state is permanently advanced.

Why add new `ReasoningParser` methods?

The scheduler should not hard-code model/parser-specific reasoning-end markers. Different reasoning parsers may use different marker shapes, including single-token markers, multi-token markers, or context-dependent streaming checks.

This PR adds parser-level APIs so the scheduler can ask the active parser:

whether a speculative token delta may contain a reasoning-end marker;
where the reasoning-end marker completes inside that delta.

This keeps boundary detection owned by the reasoning parser, while keeping the scheduler logic generic. The default implementation is conservative, and BaseThinkingReasoningParser provides an efficient override for single-token end markers.

Why guard this with an env var?

The new path is both behavior-sensitive and performance-sensitive.

Behavior-wise, it changes a narrow combination of features: speculative decoding + reasoning parser + structured output + grammar disabled during reasoning. Parser implementations may not all have been validated with this speculative boundary path yet, especially if their reasoning-end markers are multi-token or context-dependent.

Performance-wise, this logic runs in the scheduler output path. Even with a cheap precheck, enabling it globally would add extra branching and parser calls to speculative decoding requests that use structured output. For common parsers with a single-token end marker, the added may_have_reasoning_end_in_delta() check is cheap, but the generic fallback is intentionally conservative and may need to inspect the accepted delta more deeply. Keeping the feature behind an env var avoids adding this overhead to all users by default.

I observed a significant performance drop when using a custom reasoning parser (with multiple tokens as reasoning ends), so I believe it would be better not to enable this behavior by default.

For that reason, the behavior is opt-in behind:

VLLM_SPEC_REASONING_BOUNDARY_VALIDATION=1

The default remains disabled to avoid unexpected regressions and unnecessary scheduler-path overhead while allowing users and CI to validate the behavior incrementally.

Test Plan

Run focused tests for reasoning-boundary detection and scheduler handling of speculative tokens across the reasoning boundary.

Test Result

.venv/bin/python -m pytest tests/reasoning/test_reasoning_boundary_index.py tests/v1/structured_output/test_spec_decode_reasoning_boundary.py -q

Result:

9 passed, 16 warnings

AI assistance was used to help prepare this change and PR description. The human submitter reviewed the change and is responsible for validating the behavior end-to-end.

…ive tokens Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex

Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces reasoning-boundary validation for speculative decoding in the V1 scheduler. It enables the engine to detect the end of a reasoning phase within accepted speculative tokens and ensures that any following tokens are correctly validated against the grammar. Feedback identifies a high-severity calculation error in speculative decoding statistics when grammar truncation occurs and a performance concern regarding O(N) list copying in the fallback reasoning-end detection logic.

gemini-code-assist · 2026-04-27T04:57:53Z

+                spec_decoding_stats = self.make_spec_decoding_stats(
+                    spec_decoding_stats,
+                    num_draft_tokens=num_draft_tokens,
+                    num_accepted_tokens=min(len(new_token_ids), num_accepted),


The calculation of num_accepted_tokens for speculative decoding statistics is incorrect when grammar truncation occurs. len(new_token_ids) includes the 'bonus' token (the one sampled from the target model), while num_accepted represents the number of draft tokens accepted by the target model (calculated as len(generated_token_ids) - 1 at line 1380). If the grammar truncates the accepted tokens, len(new_token_ids) will be 1 (bonus) + num_actually_accepted_drafts. Thus, num_accepted_tokens should be calculated as min(len(new_token_ids) - 1, num_accepted) to correctly reflect the number of draft tokens that survived both target model verification and grammar validation. Using min(len(new_token_ids), num_accepted) will over-count the accepted draft tokens by 1 in most cases where truncation happens after the bonus token.

Suggested change

num_accepted_tokens=min(len(new_token_ids), num_accepted),

num_accepted_tokens=min(max(0, len(new_token_ids) - 1), num_accepted),

gemini-code-assist · 2026-04-27T04:57:53Z

+            The index in ``delta_ids`` where the reasoning-end marker completes,
+            or ``None`` if the marker does not complete inside ``delta_ids``.
+        """
+        current_input_ids = list(prefix_ids)


This line creates a full copy of prefix_ids (the entire request history) on every speculative decoding step for requests using structured output across a reasoning boundary. For long contexts (e.g., 128k tokens), this $O(N)$ operation inside the scheduler's hot loop will cause a significant performance degradation. Since is_reasoning_end_streaming is called in a loop over delta_ids, the overall complexity is $O(N + D)$ where $N$ is context length and $D$ is the number of speculative tokens. While the common BaseThinkingReasoningParser provides an optimized override, other parsers using this fallback will suffer from poor scalability. Consider optimizing this by only passing a suffix of the prefix that is long enough to contain any potential reasoning-end marker, or by using a sequence wrapper that avoids physical concatenation.

mergify · 2026-05-23T09:29:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @liuyanyi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

liuyanyi added 3 commits April 27, 2026 04:53

feat(reasoning): implement reasoning boundary validation for speculat…

b0344a9

…ive tokens Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex

[Bugfix] 修复调度器在边界路径中对推理状态的初始化要求

3a2e43f

Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex

fix(envs): 移动推理边界验证环境变量到字典末尾以提高可读性

e8d8ac5

Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex

liuyanyi requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, bbrowning, benchislett, chaunceyjiang, heheda12345, mgoin, njhill, orozery, robertgshaw2-redhat, russellb, sfeng33 and ywang96 as code owners April 27, 2026 04:55

claude Bot reviewed Apr 27, 2026

View reviewed changes

mergify Bot added structured-output v1 labels Apr 27, 2026

github-project-automation Bot added this to Structured Output Apr 27, 2026

mergify Bot added the bug Something isn't working label Apr 27, 2026

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

mergify Bot added the needs-rebase label May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Validate post-reasoning structured output tokens in spec decode#40962

[Bugfix] Validate post-reasoning structured output tokens in spec decode#40962
liuyanyi wants to merge 3 commits into
vllm-project:mainfrom
liuyanyi:fix_mtp_reason

liuyanyi commented Apr 27, 2026

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	num_accepted_tokens=min(len(new_token_ids), num_accepted),
	num_accepted_tokens=min(max(0, len(new_token_ids) - 1), num_accepted),

Uh oh!

Conversation

liuyanyi commented Apr 27, 2026

Purpose

Relation to #36138

Why add new ReasoningParser methods?

Why guard this with an env var?

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why add new `ReasoningParser` methods?