[Bugfix] Validate post-reasoning structured output tokens in spec decode#40962
[Bugfix] Validate post-reasoning structured output tokens in spec decode#40962liuyanyi wants to merge 3 commits into
Conversation
…ive tokens Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex
Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex
Signed-off-by: liuyanyi <wolfsonliu@163.com> Assisted-by: OpenAI Codex
There was a problem hiding this comment.
Code Review
This pull request introduces reasoning-boundary validation for speculative decoding in the V1 scheduler. It enables the engine to detect the end of a reasoning phase within accepted speculative tokens and ensures that any following tokens are correctly validated against the grammar. Feedback identifies a high-severity calculation error in speculative decoding statistics when grammar truncation occurs and a performance concern regarding O(N) list copying in the fallback reasoning-end detection logic.
| spec_decoding_stats = self.make_spec_decoding_stats( | ||
| spec_decoding_stats, | ||
| num_draft_tokens=num_draft_tokens, | ||
| num_accepted_tokens=min(len(new_token_ids), num_accepted), |
There was a problem hiding this comment.
The calculation of num_accepted_tokens for speculative decoding statistics is incorrect when grammar truncation occurs. len(new_token_ids) includes the 'bonus' token (the one sampled from the target model), while num_accepted represents the number of draft tokens accepted by the target model (calculated as len(generated_token_ids) - 1 at line 1380). If the grammar truncates the accepted tokens, len(new_token_ids) will be 1 (bonus) + num_actually_accepted_drafts. Thus, num_accepted_tokens should be calculated as min(len(new_token_ids) - 1, num_accepted) to correctly reflect the number of draft tokens that survived both target model verification and grammar validation. Using min(len(new_token_ids), num_accepted) will over-count the accepted draft tokens by 1 in most cases where truncation happens after the bonus token.
| num_accepted_tokens=min(len(new_token_ids), num_accepted), | |
| num_accepted_tokens=min(max(0, len(new_token_ids) - 1), num_accepted), |
| The index in ``delta_ids`` where the reasoning-end marker completes, | ||
| or ``None`` if the marker does not complete inside ``delta_ids``. | ||
| """ | ||
| current_input_ids = list(prefix_ids) |
There was a problem hiding this comment.
This line creates a full copy of prefix_ids (the entire request history) on every speculative decoding step for requests using structured output across a reasoning boundary. For long contexts (e.g., 128k tokens), this is_reasoning_end_streaming is called in a loop over delta_ids, the overall complexity is BaseThinkingReasoningParser provides an optimized override, other parsers using this fallback will suffer from poor scalability. Consider optimizing this by only passing a suffix of the prefix that is long enough to contain any potential reasoning-end marker, or by using a sequence wrapper that avoids physical concatenation.
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR fixes the interaction between speculative decoding, reasoning parsers, and structured output.
When speculative decoding accepts multiple tokens in one scheduler step, the accepted token delta may cross the reasoning boundary. Tokens before the reasoning-end marker are still reasoning tokens and should not be constrained by the structured-output grammar, but tokens after the boundary are answer tokens and must be grammar-validated before being committed.
This PR adds an opt-in scheduler path that:
I’m not particularly familiar with the full scheduler or the spec path, so I’m not sure if this change will have any negative impact on the scheduler as a whole. I’d appreciate it if reviewer could take a look.
Relation to #36138
This PR is related to #36138, which addresses the same user-visible bug: when speculative decoding accepts a batch that contains the reasoning-end marker, grammar constraints may be skipped for the answer tokens that appear after that marker.
The two PRs differ mainly in where they solve the problem and how much of the speculative-decoding pipeline they change.
#36138 treats this as a general “mixed constrained/unconstrained speculative batch” problem. It splits draft/speculative token batches into unconstrained and constrained parts, then reuses that split across multiple grammar interaction points:
update_from_output();update_draft_token_ids();update_draft_token_ids_in_output();grammar_bitmask().It also makes
grammar_bitmask()reasoning-aware at the per-position level: positions before and including the reasoning-end marker are left unconstrained, while positions after the marker are grammar-constrained. To compute those later bitmasks, it may temporarily advance grammar state over scheduled speculative tokens and then roll the grammar back.This PR intentionally takes a narrower approach.
Instead of changing draft-token validation and bitmask generation, it handles the issue at the accepted-token commit point in
Scheduler.update_from_output(). At that point, the target model has already accepted the speculative tokens, and the scheduler is about to append them to the final request output. If the accepted token delta crosses the reasoning boundary, this PR validates only the post-boundary suffix, truncates rejected suffix tokens, and commits only the valid answer-phase suffix to the grammar state.The tradeoff is intentional: #36138 is a broader pipeline-level fix that attempts to make speculative grammar constraints correct earlier, including bitmask construction. This PR is a smaller, opt-in, commit-time validation fix. It does not try to produce mixed-position grammar masks for the same speculative batch; instead, it guarantees that answer-phase tokens are grammar-validated before they enter the final output and before grammar state is permanently advanced.
Why add new
ReasoningParsermethods?The scheduler should not hard-code model/parser-specific reasoning-end markers. Different reasoning parsers may use different marker shapes, including single-token markers, multi-token markers, or context-dependent streaming checks.
This PR adds parser-level APIs so the scheduler can ask the active parser:
This keeps boundary detection owned by the reasoning parser, while keeping the scheduler logic generic. The default implementation is conservative, and
BaseThinkingReasoningParserprovides an efficient override for single-token end markers.Why guard this with an env var?
The new path is both behavior-sensitive and performance-sensitive.
Behavior-wise, it changes a narrow combination of features: speculative decoding + reasoning parser + structured output + grammar disabled during reasoning. Parser implementations may not all have been validated with this speculative boundary path yet, especially if their reasoning-end markers are multi-token or context-dependent.
Performance-wise, this logic runs in the scheduler output path. Even with a cheap precheck, enabling it globally would add extra branching and parser calls to speculative decoding requests that use structured output. For common parsers with a single-token end marker, the added
may_have_reasoning_end_in_delta()check is cheap, but the generic fallback is intentionally conservative and may need to inspect the accepted delta more deeply. Keeping the feature behind an env var avoids adding this overhead to all users by default.I observed a significant performance drop when using a custom reasoning parser (with multiple tokens as reasoning ends), so I believe it would be better not to enable this behavior by default.
For that reason, the behavior is opt-in behind:
The default remains disabled to avoid unexpected regressions and unnecessary scheduler-path overhead while allowing users and CI to validate the behavior incrementally.
Test Plan
Run focused tests for reasoning-boundary detection and scheduler handling of speculative tokens across the reasoning boundary.
Test Result
Result:
AI assistance was used to help prepare this change and PR description. The human submitter reviewed the change and is responsible for validating the behavior end-to-end.