Skip to content

[Bugfix] Validate post-reasoning structured output tokens in spec decode#40962

Open
liuyanyi wants to merge 3 commits into
vllm-project:mainfrom
liuyanyi:fix_mtp_reason
Open

[Bugfix] Validate post-reasoning structured output tokens in spec decode#40962
liuyanyi wants to merge 3 commits into
vllm-project:mainfrom
liuyanyi:fix_mtp_reason

Conversation

@liuyanyi
Copy link
Copy Markdown
Contributor

Purpose

This PR fixes the interaction between speculative decoding, reasoning parsers, and structured output.

When speculative decoding accepts multiple tokens in one scheduler step, the accepted token delta may cross the reasoning boundary. Tokens before the reasoning-end marker are still reasoning tokens and should not be constrained by the structured-output grammar, but tokens after the boundary are answer tokens and must be grammar-validated before being committed.

This PR adds an opt-in scheduler path that:

  • detects when a reasoning-end marker completes inside accepted speculative tokens;
  • validates only the post-boundary answer suffix with the structured-output grammar;
  • truncates rejected post-boundary speculative tokens before appending them to request output;
  • keeps scheduler accounting consistent when grammar validation rejects speculative tokens.

I’m not particularly familiar with the full scheduler or the spec path, so I’m not sure if this change will have any negative impact on the scheduler as a whole. I’d appreciate it if reviewer could take a look.

Relation to #36138

This PR is related to #36138, which addresses the same user-visible bug: when speculative decoding accepts a batch that contains the reasoning-end marker, grammar constraints may be skipped for the answer tokens that appear after that marker.

The two PRs differ mainly in where they solve the problem and how much of the speculative-decoding pipeline they change.

#36138 treats this as a general “mixed constrained/unconstrained speculative batch” problem. It splits draft/speculative token batches into unconstrained and constrained parts, then reuses that split across multiple grammar interaction points:

  • update_from_output();
  • update_draft_token_ids();
  • update_draft_token_ids_in_output();
  • grammar_bitmask().

It also makes grammar_bitmask() reasoning-aware at the per-position level: positions before and including the reasoning-end marker are left unconstrained, while positions after the marker are grammar-constrained. To compute those later bitmasks, it may temporarily advance grammar state over scheduled speculative tokens and then roll the grammar back.

This PR intentionally takes a narrower approach.

Instead of changing draft-token validation and bitmask generation, it handles the issue at the accepted-token commit point in Scheduler.update_from_output(). At that point, the target model has already accepted the speculative tokens, and the scheduler is about to append them to the final request output. If the accepted token delta crosses the reasoning boundary, this PR validates only the post-boundary suffix, truncates rejected suffix tokens, and commits only the valid answer-phase suffix to the grammar state.

The tradeoff is intentional: #36138 is a broader pipeline-level fix that attempts to make speculative grammar constraints correct earlier, including bitmask construction. This PR is a smaller, opt-in, commit-time validation fix. It does not try to produce mixed-position grammar masks for the same speculative batch; instead, it guarantees that answer-phase tokens are grammar-validated before they enter the final output and before grammar state is permanently advanced.

Why add new ReasoningParser methods?

The scheduler should not hard-code model/parser-specific reasoning-end markers. Different reasoning parsers may use different marker shapes, including single-token markers, multi-token markers, or context-dependent streaming checks.

This PR adds parser-level APIs so the scheduler can ask the active parser:

  • whether a speculative token delta may contain a reasoning-end marker;
  • where the reasoning-end marker completes inside that delta.

This keeps boundary detection owned by the reasoning parser, while keeping the scheduler logic generic. The default implementation is conservative, and BaseThinkingReasoningParser provides an efficient override for single-token end markers.

Why guard this with an env var?

The new path is both behavior-sensitive and performance-sensitive.

Behavior-wise, it changes a narrow combination of features: speculative decoding + reasoning parser + structured output + grammar disabled during reasoning. Parser implementations may not all have been validated with this speculative boundary path yet, especially if their reasoning-end markers are multi-token or context-dependent.

Performance-wise, this logic runs in the scheduler output path. Even with a cheap precheck, enabling it globally would add extra branching and parser calls to speculative decoding requests that use structured output. For common parsers with a single-token end marker, the added may_have_reasoning_end_in_delta() check is cheap, but the generic fallback is intentionally conservative and may need to inspect the accepted delta more deeply. Keeping the feature behind an env var avoids adding this overhead to all users by default.

I observed a significant performance drop when using a custom reasoning parser (with multiple tokens as reasoning ends), so I believe it would be better not to enable this behavior by default.

For that reason, the behavior is opt-in behind:

VLLM_SPEC_REASONING_BOUNDARY_VALIDATION=1

The default remains disabled to avoid unexpected regressions and unnecessary scheduler-path overhead while allowing users and CI to validate the behavior incrementally.

Test Plan

Run focused tests for reasoning-boundary detection and scheduler handling of speculative tokens across the reasoning boundary.

Test Result

.venv/bin/python -m pytest tests/reasoning/test_reasoning_boundary_index.py tests/v1/structured_output/test_spec_decode_reasoning_boundary.py -q

Result:

9 passed, 16 warnings

AI assistance was used to help prepare this change and PR description. The human submitter reviewed the change and is responsible for validating the behavior end-to-end.

…ive tokens

Signed-off-by: liuyanyi <wolfsonliu@163.com>
Assisted-by: OpenAI Codex
Signed-off-by: liuyanyi <wolfsonliu@163.com>
Assisted-by: OpenAI Codex
Signed-off-by: liuyanyi <wolfsonliu@163.com>
Assisted-by: OpenAI Codex
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the bug Something isn't working label Apr 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces reasoning-boundary validation for speculative decoding in the V1 scheduler. It enables the engine to detect the end of a reasoning phase within accepted speculative tokens and ensures that any following tokens are correctly validated against the grammar. Feedback identifies a high-severity calculation error in speculative decoding statistics when grammar truncation occurs and a performance concern regarding O(N) list copying in the fallback reasoning-end detection logic.

spec_decoding_stats = self.make_spec_decoding_stats(
spec_decoding_stats,
num_draft_tokens=num_draft_tokens,
num_accepted_tokens=min(len(new_token_ids), num_accepted),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of num_accepted_tokens for speculative decoding statistics is incorrect when grammar truncation occurs. len(new_token_ids) includes the 'bonus' token (the one sampled from the target model), while num_accepted represents the number of draft tokens accepted by the target model (calculated as len(generated_token_ids) - 1 at line 1380). If the grammar truncates the accepted tokens, len(new_token_ids) will be 1 (bonus) + num_actually_accepted_drafts. Thus, num_accepted_tokens should be calculated as min(len(new_token_ids) - 1, num_accepted) to correctly reflect the number of draft tokens that survived both target model verification and grammar validation. Using min(len(new_token_ids), num_accepted) will over-count the accepted draft tokens by 1 in most cases where truncation happens after the bonus token.

Suggested change
num_accepted_tokens=min(len(new_token_ids), num_accepted),
num_accepted_tokens=min(max(0, len(new_token_ids) - 1), num_accepted),

The index in ``delta_ids`` where the reasoning-end marker completes,
or ``None`` if the marker does not complete inside ``delta_ids``.
"""
current_input_ids = list(prefix_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line creates a full copy of prefix_ids (the entire request history) on every speculative decoding step for requests using structured output across a reasoning boundary. For long contexts (e.g., 128k tokens), this $O(N)$ operation inside the scheduler's hot loop will cause a significant performance degradation. Since is_reasoning_end_streaming is called in a loop over delta_ids, the overall complexity is $O(N + D)$ where $N$ is context length and $D$ is the number of speculative tokens. While the common BaseThinkingReasoningParser provides an optimized override, other parsers using this fallback will suffer from poor scalability. Consider optimizing this by only passing a suffix of the prefix that is long enough to contain any potential reasoning-end marker, or by using a sequence wrapper that avoids physical concatenation.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @liuyanyi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant