Skip to content

[Bugfix] Fix reasoning end token missed by should_advance under async scheduling + spec decode#43526

Closed
oneraghavan wants to merge 1 commit into
vllm-project:mainfrom
oneraghavan:fix/structured-output-reasoning-end-spec-decode
Closed

[Bugfix] Fix reasoning end token missed by should_advance under async scheduling + spec decode#43526
oneraghavan wants to merge 1 commit into
vllm-project:mainfrom
oneraghavan:fix/structured-output-reasoning-end-spec-decode

Conversation

@oneraghavan
Copy link
Copy Markdown
Contributor

@oneraghavan oneraghavan commented May 24, 2026

Purpose

Fixes #43388, #34650

When async scheduling and speculative decoding are both enabled, should_advance() misses the reasoning end token (e.g. </think>) because placeholder arithmetic produces an empty delta window.

Root Cause

should_advance() computes a delta slice of recently-generated tokens via:

delta_from = request.num_computed_tokens - request.num_output_placeholders
delta = all_token_ids[delta_from:]

Under async scheduling + spec decode:

  1. num_computed_tokens is pre-incremented by _update_after_schedule (before GPU runs)
  2. Adjusted down for rejected spec tokens in update_from_output
  3. num_output_placeholders follows a parallel increment/decrement cycle
  4. After token append + placeholder decrement, the arithmetic can produce delta_from == len(all_token_ids), yielding an empty delta

The reasoning end token sits in all_token_ids but the delta window walks past it. reasoning_ended never becomes True, and JSON grammar constraints are never applied.

Observed in production (from issue #43388):

new_token_ids = [9, 198, 248069, 271]   # 248069 = </think>
start = 5975                              # one past the end token at index 5974
delta = []                                # empty!

Fix

Add an optional new_token_ids parameter to should_advance(). At the token-output call site in update_from_output (the only path where new_token_ids is naturally available), pass the actual tokens directly. This bypasses the fragile placeholder arithmetic entirely.

The fallback path (draft-token call sites in update_draft_token_ids / update_draft_token_ids_in_output) is unchanged — those rely on reasoning_ended already being set by the prior update_from_output call.

Changes

File Change
vllm/v1/structured_output/__init__.py Add new_token_ids param to should_advance(); use it as delta when provided
vllm/v1/core/sched/scheduler.py Pass new_token_ids at the update_from_output call site
tests/v1/structured_output/test_reasoning_structured_output.py 4 new unit tests covering the bug and fix
tests/entrypoints/llm/test_struct_output_generate.py New E2E matrix entry: Qwen3 + spec decode + async scheduling

Test Plan

Unit tests (4 new, all pass locally):

  • test_should_advance_with_new_token_ids_detects_reasoning_end — end token found via new_token_ids regardless of placeholder state
  • test_should_advance_async_spec_decode_empty_delta_misses_end_tokenreproduces the exact bug (empty delta), then shows new_token_ids fixes it
  • test_should_advance_new_token_ids_structural_tag_spec_decode — structural tag + spec decode same-step True (no regression)
  • test_should_advance_new_token_ids_no_end_token — negative case: no end token → reasoning_ended stays False

E2E test (new matrix entry):

  • Qwen/Qwen3-1.7B + qwen3 reasoning parser + ngram spec decode + async_scheduling=True — the exact combination that triggers the bug

Existing tests (all 13 pass):

tests/v1/structured_output/test_reasoning_structured_output.py — 13 passed

Related Work

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the reliability of reasoning end token detection in structured output by passing new_token_ids directly to the should_advance method. This update avoids potential issues with placeholder arithmetic during asynchronous scheduling and speculative decoding. The changes include new unit tests covering various scenarios and an additional test case for Qwen3 models. I have no further feedback to provide as there were no review comments.

@oneraghavan oneraghavan force-pushed the fix/structured-output-reasoning-end-spec-decode branch from 9683efb to 14c6e4d Compare May 24, 2026 13:14
… scheduling + spec decode

Fixes vllm-project#43388, vllm-project#34650

When async scheduling and speculative decoding are both enabled,
should_advance() reconstructs the new tokens delta via placeholder
arithmetic (num_computed_tokens - num_output_placeholders). After spec
decode rejection adjustments and token appending, this arithmetic can
produce start == len(all_token_ids), yielding an empty delta that misses
the reasoning end token (e.g. </think>). As a result, reasoning_ended
never flips to True and JSON grammar constraints are never applied.

Fix: add an optional new_token_ids parameter to should_advance(). When
the caller has the actual new tokens (the token-output path in
update_from_output), pass them directly so the method checks them
instead of re-deriving the delta from counter arithmetic. The fallback
path (draft-token call sites) is unchanged.

Signed-off-by: Raghavan <oneraghavan@gmail.com>
@oneraghavan oneraghavan force-pushed the fix/structured-output-reasoning-end-spec-decode branch from 14c6e4d to b5bb916 Compare May 24, 2026 13:18
@huanghuan3
Copy link
Copy Markdown

huanghuan3 commented May 25, 2026

new_token_ids=[248069, 271, 71093]
248069 = </think>
71093 may correspond to the beginning of ```json {} ```

After changing the delta window to include the full new_token_ids batch, vLLM can correctly set reasoning_ended=True. However, any tokens generated after in that same batch were still sampled before the JSON grammar constraint became active.

This means the fix can make apply=True from the following step onward, but it cannot retroactively constrain, reject, or remove the already-generated post-thinking tokens. If those tokens include orjson, the final content may still contain Markdown fences.

There is also a related grammar state synchronization issue. If the first JSON token after has already been generated in the same speculative batch but was not accepted by the grammar FSM, the next constrained step may still expect the initial {, causing duplicated opening braces such as:

{{

how about this problem?

@oneraghavan
Copy link
Copy Markdown
Contributor Author

Closing this PR in favor of #43424, which takes a more comprehensive approach.

Our fix here addresses the detection side (passing new_token_ids directly to should_advance to avoid the fragile placeholder arithmetic), but it doesn't handle the case where </think> lands mid-batch in a speculative decode step — the post-reasoning tokens in the same batch escape grammar validation entirely, as correctly pointed out by @huanghuan3.

#43424 solves this with a pre-commit filter (precommit_filter_tokens) that intercepts the token batch before _update_request_with_output, splits at the reasoning boundary, and validates/truncates post-boundary tokens. This approach:

  • Handles both single-token and multi-token reasoning markers
  • Composes cleanly with the existing should_advance deferral logic
  • Has been production-tested on multiple model/spec-decode configurations
  • Is additive (+144 lines, 0 deletions) with no existing API changes

Credit to @sfbemerk for the original analysis in #36138 that charted the path for all of these fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working structured-output v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: json_object structured output is not enforced after Qwen thinking because reasoning end token is missed with async scheduling + spec decode

2 participants