Skip to content

[Bugfix] Grammar was ignored when reasoning ended within speculated tokens#36138

Open
sfbemerk wants to merge 6 commits into
vllm-project:mainfrom
sfbemerk:bugfix/specdecode-grammar-reasoning-new
Open

[Bugfix] Grammar was ignored when reasoning ended within speculated tokens#36138
sfbemerk wants to merge 6 commits into
vllm-project:mainfrom
sfbemerk:bugfix/specdecode-grammar-reasoning-new

Conversation

@sfbemerk
Copy link
Copy Markdown
Contributor

@sfbemerk sfbemerk commented Mar 5, 2026

Purpose

This PR attempts to fix a bug (#31858, #34650) when Speculative Decoding (such as MTP), Reasoning, and Structured Output / Grammar are used in combination: typically, grammar is not enabled during reasoning but only for the final answer. However, when the reasoning end token is generated, any subsequent draft tokens are not validated against the grammar, leading to an invalid final answer.

Test Plan

In general, the bug seems to be independent of the specific SpecDecode method; originally I had observed it with DeepSeek models and MTP, but for testing I recommend a smaller model like Qwen3-8B and using the same model as draft model. This way, we have high acceptance rates for our tests and a high likelihood that the original bug appears.

vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"draft_model","model":"Qwen/Qwen3-8B","num_speculative_tokens":5}'

The test request should have response_format=json_schema and a prompt that lurkes the model into generating not pure json, e.g.

example payload { "model": "Qwen/Qwen3-8B", "messages": [ { "role": "user", "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```" } ], "response_format": { "type": "json_schema", "json_schema": { "name": "hero", "schema": { "$defs": { "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"} }, "properties": { "name": {"description": "Character name", "title": "Name", "type": "string"}, "age": {"description": "Character age", "title": "Age", "type": "integer"}, "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"} }, "required": ["name", "age", "role"], "title": "Character", "type": "object" } } } }

The original bug can also be reproduced for Model Runner V2, this bugfix works there as well. For testing, you should choose a different speculative method (since draft_model is not supported yet):

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"eagle3","model":"RedHatAI/Qwen3-8B-speculator.eagle3","num_speculative_tokens":5}'

The original bug is still present in vllm v0.17.1.

Test Result

without bugfix, the content field contains invalid json, e.g. because of markdown fences

"content": "```json\n{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}```"

with the bugfix, the content field contains valid json that satisfies the requested grammar

"content": "{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}"

I am happy to receive feedback and suggestions on how to improve the PR: the interplay of spec decode, grammar, reasoning, and async scheduling seems to be quite complex.

Related

There had been several attempts to fix this bug before: my first attempt in #34241 would reject all speculated tokens in the step where reasoning_end was detected, which was working fine, but was suboptimal. #34978 started with a better approach that would validate all speculative tokens following reasoning_end, but contained some bugs in the end and was discontinued.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug where grammar constraints were not applied to speculative tokens generated after a reasoning-end marker. The changes correctly identify when reasoning ends within a batch of tokens and ensure that only post-reasoning tokens are validated against the grammar. The patch modifies the StructuredOutputManager to detect the reasoning end within token batches and provides new helper methods to split tokens accordingly. The Scheduler is updated to use this new logic when advancing the grammar and validating speculative tokens. The changes are supported by a comprehensive set of new unit tests. My main feedback is to refactor duplicated code in the Scheduler for better maintainability.

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@njhill
Copy link
Copy Markdown
Member

njhill commented Mar 12, 2026

Thanks @sfbemerk. This appears to add quite a lot of code and complexity, would be good if we can find a much simpler fix.

@sfbemerk
Copy link
Copy Markdown
Contributor Author

sfbemerk commented Mar 12, 2026

Hi @njhill , thanks for looking into this. Fully agreed - a simpler solution is always better.

The original issue is that in all places where draft tokens and grammar interact (in update_from_output(), update_draft_token_ids(), and update_draft_token_ids_in_output(), as well as in grammar_bitmask()), it has always been an either-or decision: either the entire batch of speculative tokens should be constrained or none. But this simple approach comes to its limits when a reasoning_end token appears in the draft token batch, and all follow-up tokens in the same batch now need to be constrained.

My current approach is: split the batch of speculated tokens where reasoning_end is detected, and then let the first part pass through as unconstrained tokens, while the second part (constrained tokens) is validated by the grammar. I added a method identify_constrained_draft_tokens() which performs exactly such split (relying on the reasoning_parser.is_reasoning_end_streaming() method) and which is then reused in three places, for consistent behavior.

If you have ideas to solve the underlying issue without such complexity, I am happy to follow your suggestions!

Benjamin Merkel added 6 commits April 14, 2026 14:28
…soning speculated tokens

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
…ing review suggestion

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Previously, a reasoning_end draft token appearing mid-batch would skip grammar constraints if reasoning_ended=True had already been set in a previous step.

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
@sfbemerk sfbemerk force-pushed the bugfix/specdecode-grammar-reasoning-new branch from 6d5e65b to 94f4dc2 Compare April 15, 2026 13:33
@sfbemerk
Copy link
Copy Markdown
Contributor Author

I rebased this pull request.
The bug is still present in v0.19.0, the fix still works.

@njhill It would be great if you could find some time to review the changes or think about a completely different approach.

@danielwit-lb
Copy link
Copy Markdown

Up, this is important fix, without it speculative decoding is effectively useless

@Sandermage
Copy link
Copy Markdown
Contributor

@sfbemerk — first, thank you for this PR. The clarity of your "spec_token_ids was overloaded with two different meanings" analysis on the parent issues (#34650 + #31858) was a turning point in our investigation — once that diagnosis was on the table, we could stop chasing model-output theories and start auditing the spec-decode + structured-output timing path. Backported on a Qwen3.6-35B-A3B-FP8 production rig (2× A5000, vLLM 0.19.2rc1.dev205+g07351e088) as part of the v7.13 multi-PR investigation. Wanted to share data and a small implementation note.

What we backported

All three of your changes:

  • grammar_bitmask reasoning-aware loop (computes reasoning_end_idx per request, applies bitmask only to post-reasoning positions)
  • New helpers: update_reasoning_ended(), validate_tokens_reasoning_aware(), identify_constrained_draft_tokens(), _find_reasoning_end_in_tokens()
  • Updated three call sites in scheduler.py (update_from_output, update_draft_token_ids, update_draft_token_ids_in_output)

Implementation note

Kept should_advance() alive as a no-op-equivalent (didn't delete it) to reduce blast radius for any external callers we might not have seen. Your PR removes it cleanly; ours keeps it as dead code for backport-safety. Worth noting in case anyone else does a backport.

Empirical impact on our setup

Standalone delta of P62 (your fix) on top of the GDN+ngram fix from #40738 was ~3% incremental clean rate (53% → 56% on n=30 reproducer). Smaller than expected — turned out our dominant residual mode was actually ngram acceptance bias toward XML-repeat patterns, not the structured-output reasoning timing your PR addresses.

Why the small delta in our specific case: we run enable_thinking=false in chat template, so <think></think> is empty in the prompt and should_fill_bitmask() returns True from step 0. Your PR's main payoff is when </think> arrives MID-spec-batch — that path doesn't fire for us. But the patch is still correct + useful for any setup that uses thinking + structured outputs together. The ~3% improvement we see is presumably from the implicit <tool_call> reasoning-end being detected mid-batch.

Backport reference

patch_62_structured_output_spec_decode_timing.py — opt-in text patch with anchor validation, drift marker (validate_tokens_reasoning_aware), auto-no-op if your PR lands upstream. Credit to you + @cicirori (#34650) in the docstring + CREDITS.md.

Hope the data point helps the review. Cleanly-implemented PR — anchors held perfectly against dev205, no hidden dependencies.

@cjackal
Copy link
Copy Markdown
Contributor

cjackal commented May 7, 2026

It seems this PR need a rebase after #41199

@adammoisa
Copy link
Copy Markdown

Hi @sfbemerk — thanks for the original analysis here, it pointed us at the right area of the code.

We hit this bug hard on gpt-oss-120b + EAGLE3 + response_format: json_schema strict. Your fix works perfectly for single-token reasoning markers (we verified 0% prefix-bled on Qwen3 + MTP + your PR's HEAD across 150 production traces), but it doesn't catch multi-token markers like Harmony's <|channel|>final<|message|>, because the boundary detector scans only within spec_token_ids (the 2-5 token speculative batch) and the full Harmony sequence almost never lands inside one batch. We filed #43338 with the full reproducer and data.

I just opened #43424 with a generalization of your approach: same insight (validate reasoning-end-aware), but the validation moves pre-commit so it also handles the related failure mode where verifier bonus tokens are sampled without the grammar mask engaged (the mask is gated on reasoning_ended, which only flips True after the boundary is observed). Multi-token marker detection comes along naturally from the same helper. Credit you in the PR body — none of that is novel without the path you charted.

If you'd rather land your PR first and have us layer on top, happy to rework #43424 as a stacked PR instead. Either way, want to make sure your work gets the lineage it deserves. cc the maintainers since the gpt-oss case is biting users in production.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sfbemerk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
adammoisa added a commit to adammoisa/vllm that referenced this pull request May 27, 2026
Activates the pre-commit grammar filter added in the previous
commit. Before _update_request_with_output appends new tokens to
the request, the scheduler calls
StructuredOutputManager.precommit_filter_tokens and, when any
trailing tokens are rejected, truncates new_token_ids and
decrements num_computed_tokens and num_output_placeholders by the
rejected count. This mirrors the existing path for
verifier-rejected speculative tokens (see lines 1361-1373 in
Scheduler.update_from_output).

The bug this addresses: in speculative decoding with a reasoning
parser, the grammar bitmask is gated on reasoning_ended (see
StructuredOutputManager.should_fill_bitmask). The boundary step's
bonus tokens are sampled WITHOUT the mask engaged — reasoning_ended
is False at sample time and only flips True after the boundary is
observed. The existing should_advance deferral correctly suppresses
the post-commit accept_tokens call on the boundary step (avoiding
spurious FINISHED_ERROR), but the bonus tokens are already in the
response stream as garbage prefixes before the valid JSON.

Repro: gpt-oss-120b + EAGLE3 + response_format json_schema strict
shows the failure on ~54% of requests on main HEAD with the
Harmony multi-token marker. With this patch it drops to <5% in our
300-trace shadow.

Refs: vllm-project#36138 (single-token version of this fix by sfbemerk), vllm-project#43338.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Adam Moisa <adammoisa@gmail.com>
@sfbemerk
Copy link
Copy Markdown
Contributor Author

Thanks, @adammoisa for improving the fix. I don't mind having your implementation merged instead. I am more interested in getting ANY fix for the issue merged ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants