[Bugfix] Grammar was ignored when reasoning ended within speculated tokens by sfbemerk · Pull Request #36138 · vllm-project/vllm

sfbemerk · 2026-03-05T12:43:34Z

Purpose

This PR attempts to fix a bug (#31858, #34650) when Speculative Decoding (such as MTP), Reasoning, and Structured Output / Grammar are used in combination: typically, grammar is not enabled during reasoning but only for the final answer. However, when the reasoning end token is generated, any subsequent draft tokens are not validated against the grammar, leading to an invalid final answer.

Test Plan

In general, the bug seems to be independent of the specific SpecDecode method; originally I had observed it with DeepSeek models and MTP, but for testing I recommend a smaller model like Qwen3-8B and using the same model as draft model. This way, we have high acceptance rates for our tests and a high likelihood that the original bug appears.

vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"draft_model","model":"Qwen/Qwen3-8B","num_speculative_tokens":5}'

The test request should have response_format=json_schema and a prompt that lurkes the model into generating not pure json, e.g.

example payload


{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {
      "role": "user",
      "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```"
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "hero",
      "schema": {
        "$defs": {
          "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"}
        },
        "properties": {
          "name": {"description": "Character name", "title": "Name", "type": "string"},
          "age": {"description": "Character age", "title": "Age", "type": "integer"},
          "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"}
        },
        "required": ["name", "age", "role"],
        "title": "Character",
        "type": "object"
      }
    }
  }
}

The original bug can also be reproduced for Model Runner V2, this bugfix works there as well. For testing, you should choose a different speculative method (since draft_model is not supported yet):

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"eagle3","model":"RedHatAI/Qwen3-8B-speculator.eagle3","num_speculative_tokens":5}'

The original bug is still present in vllm v0.17.1.

Test Result

without bugfix, the content field contains invalid json, e.g. because of markdown fences

"content": "```json\n{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}```"

with the bugfix, the content field contains valid json that satisfies the requested grammar

"content": "{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}"

I am happy to receive feedback and suggestions on how to improve the PR: the interplay of spec decode, grammar, reasoning, and async scheduling seems to be quite complex.

There had been several attempts to fix this bug before: my first attempt in #34241 would reject all speculated tokens in the step where reasoning_end was detected, which was working fine, but was suboptimal. #34978 started with a better approach that would validate all speculative tokens following reasoning_end, but contained some bugs in the end and was discontinued.

gemini-code-assist

Code Review

This pull request fixes a bug where grammar constraints were not applied to speculative tokens generated after a reasoning-end marker. The changes correctly identify when reasoning ends within a batch of tokens and ensure that only post-reasoning tokens are validated against the grammar. The patch modifies the StructuredOutputManager to detect the reasoning end within token batches and provides new helper methods to split tokens accordingly. The Scheduler is updated to use this new logic when advancing the grammar and validating speculative tokens. The changes are supported by a comprehensive set of new unit tests. My main feedback is to refactor duplicated code in the Scheduler for better maintainability.

njhill · 2026-03-12T17:17:33Z

Thanks @sfbemerk. This appears to add quite a lot of code and complexity, would be good if we can find a much simpler fix.

sfbemerk · 2026-03-12T19:46:42Z

Hi @njhill , thanks for looking into this. Fully agreed - a simpler solution is always better.

The original issue is that in all places where draft tokens and grammar interact (in update_from_output(), update_draft_token_ids(), and update_draft_token_ids_in_output(), as well as in grammar_bitmask()), it has always been an either-or decision: either the entire batch of speculative tokens should be constrained or none. But this simple approach comes to its limits when a reasoning_end token appears in the draft token batch, and all follow-up tokens in the same batch now need to be constrained.

My current approach is: split the batch of speculated tokens where reasoning_end is detected, and then let the first part pass through as unconstrained tokens, while the second part (constrained tokens) is validated by the grammar. I added a method identify_constrained_draft_tokens() which performs exactly such split (relying on the reasoning_parser.is_reasoning_end_streaming() method) and which is then reused in three places, for consistent behavior.

If you have ideas to solve the underlying issue without such complexity, I am happy to follow your suggestions!

…soning speculated tokens Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

…ing review suggestion Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

Previously, a reasoning_end draft token appearing mid-batch would skip grammar constraints if reasoning_ended=True had already been set in a previous step. Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

sfbemerk · 2026-04-15T13:35:56Z

I rebased this pull request.
The bug is still present in v0.19.0, the fix still works.

@njhill It would be great if you could find some time to review the changes or think about a completely different approach.

danielwit-lb · 2026-04-24T14:48:21Z

Up, this is important fix, without it speculative decoding is effectively useless

Sandermage · 2026-04-25T14:23:35Z

@sfbemerk — first, thank you for this PR. The clarity of your "spec_token_ids was overloaded with two different meanings" analysis on the parent issues (#34650 + #31858) was a turning point in our investigation — once that diagnosis was on the table, we could stop chasing model-output theories and start auditing the spec-decode + structured-output timing path. Backported on a Qwen3.6-35B-A3B-FP8 production rig (2× A5000, vLLM 0.19.2rc1.dev205+g07351e088) as part of the v7.13 multi-PR investigation. Wanted to share data and a small implementation note.

What we backported

All three of your changes:

grammar_bitmask reasoning-aware loop (computes reasoning_end_idx per request, applies bitmask only to post-reasoning positions)
New helpers: update_reasoning_ended(), validate_tokens_reasoning_aware(), identify_constrained_draft_tokens(), _find_reasoning_end_in_tokens()
Updated three call sites in scheduler.py (update_from_output, update_draft_token_ids, update_draft_token_ids_in_output)

Implementation note

Kept should_advance() alive as a no-op-equivalent (didn't delete it) to reduce blast radius for any external callers we might not have seen. Your PR removes it cleanly; ours keeps it as dead code for backport-safety. Worth noting in case anyone else does a backport.

Empirical impact on our setup

Standalone delta of P62 (your fix) on top of the GDN+ngram fix from #40738 was ~3% incremental clean rate (53% → 56% on n=30 reproducer). Smaller than expected — turned out our dominant residual mode was actually ngram acceptance bias toward XML-repeat patterns, not the structured-output reasoning timing your PR addresses.

Why the small delta in our specific case: we run enable_thinking=false in chat template, so <think></think> is empty in the prompt and should_fill_bitmask() returns True from step 0. Your PR's main payoff is when </think> arrives MID-spec-batch — that path doesn't fire for us. But the patch is still correct + useful for any setup that uses thinking + structured outputs together. The ~3% improvement we see is presumably from the implicit <tool_call> reasoning-end being detected mid-batch.

Backport reference

patch_62_structured_output_spec_decode_timing.py — opt-in text patch with anchor validation, drift marker (validate_tokens_reasoning_aware), auto-no-op if your PR lands upstream. Credit to you + @cicirori (#34650) in the docstring + CREDITS.md.

Hope the data point helps the review. Cleanly-implemented PR — anchors held perfectly against dev205, no hidden dependencies.

cjackal · 2026-05-07T09:49:43Z

It seems this PR need a rebase after #41199

adammoisa · 2026-05-22T13:29:14Z

Hi @sfbemerk — thanks for the original analysis here, it pointed us at the right area of the code.

We hit this bug hard on gpt-oss-120b + EAGLE3 + response_format: json_schema strict. Your fix works perfectly for single-token reasoning markers (we verified 0% prefix-bled on Qwen3 + MTP + your PR's HEAD across 150 production traces), but it doesn't catch multi-token markers like Harmony's <|channel|>final<|message|>, because the boundary detector scans only within spec_token_ids (the 2-5 token speculative batch) and the full Harmony sequence almost never lands inside one batch. We filed #43338 with the full reproducer and data.

I just opened #43424 with a generalization of your approach: same insight (validate reasoning-end-aware), but the validation moves pre-commit so it also handles the related failure mode where verifier bonus tokens are sampled without the grammar mask engaged (the mask is gated on reasoning_ended, which only flips True after the boundary is observed). Multi-token marker detection comes along naturally from the same helper. Credit you in the PR body — none of that is novel without the path you charted.

If you'd rather land your PR first and have us layer on top, happy to rework #43424 as a stacked PR instead. Either way, want to make sure your work gets the lineage it deserves. cc the maintainers since the gpt-oss case is biting users in production.

mergify · 2026-05-23T06:51:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sfbemerk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Activates the pre-commit grammar filter added in the previous commit. Before _update_request_with_output appends new tokens to the request, the scheduler calls StructuredOutputManager.precommit_filter_tokens and, when any trailing tokens are rejected, truncates new_token_ids and decrements num_computed_tokens and num_output_placeholders by the rejected count. This mirrors the existing path for verifier-rejected speculative tokens (see lines 1361-1373 in Scheduler.update_from_output). The bug this addresses: in speculative decoding with a reasoning parser, the grammar bitmask is gated on reasoning_ended (see StructuredOutputManager.should_fill_bitmask). The boundary step's bonus tokens are sampled WITHOUT the mask engaged — reasoning_ended is False at sample time and only flips True after the boundary is observed. The existing should_advance deferral correctly suppresses the post-commit accept_tokens call on the boundary step (avoiding spurious FINISHED_ERROR), but the bonus tokens are already in the response stream as garbage prefixes before the valid JSON. Repro: gpt-oss-120b + EAGLE3 + response_format json_schema strict shows the failure on ~54% of requests on main HEAD with the Harmony multi-token marker. With this patch it drops to <5% in our 300-trace shadow. Refs: vllm-project#36138 (single-token version of this fix by sfbemerk), vllm-project#43338. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Adam Moisa <adammoisa@gmail.com>

sfbemerk · 2026-05-29T09:26:11Z

Thanks, @adammoisa for improving the fix. I don't mind having your implementation merged instead. I am more interested in getting ANY fix for the issue merged ;-)

mergify Bot added structured-output v1 bug Something isn't working labels Mar 5, 2026

github-project-automation Bot added this to Structured Output Mar 5, 2026

gemini-code-assist Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

sfbemerk marked this pull request as ready for review March 5, 2026 14:30

sfbemerk requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, benchislett, heheda12345, mgoin, njhill, orozery, robertgshaw2-redhat, russellb and ywang96 as code owners March 5, 2026 14:30

This was referenced Mar 11, 2026

Bug: Speculative Decoding (MTP) Causes </think> Detection Failure in Structured Output + Reasoning Mode #34650

Open

[Bug]: GLM-5-FP8 malformed tool calls #34449

Closed

sfbemerk mentioned this pull request Apr 8, 2026

[Bug]: Qwen3.5 structured output doesn't work #35700

Open

1 task

Benjamin Merkel added 6 commits April 14, 2026 14:28

add tests that reproduce current bug: grammar not applied to post-rea…

7e8f542

…soning speculated tokens Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

fix structured output when reasoning ends in speculative tokens

0d5e561

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

add test case for should_advance

f8dc36c

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

extract _validate_spec_tokens_with_reasoning as common method, follow…

294185f

…ing review suggestion Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

fix pre-commit

8a87c05

Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

Fix edge case; refactoring.

94f4dc2

Previously, a reasoning_end draft token appearing mid-batch would skip grammar constraints if reasoning_ended=True had already been set in a previous step. Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>

sfbemerk force-pushed the bugfix/specdecode-grammar-reasoning-new branch from 6d5e65b to 94f4dc2 Compare April 15, 2026 13:33

Sandermage mentioned this pull request Apr 25, 2026

[Bug]: TurboQuant KV × any speculative decoding (MTP or ngram) produces degenerate token loops — confirmed across dense and hybrid attention #40831

Closed

Sandermage mentioned this pull request Apr 25, 2026

[Bug]: ngram speculative decoding default prompt_lookup_min=2 causes tool-call output corruption on Qwen3-class models with structured output (config-only fix: prompt_lookup_min=8) #40875

Open

noonghunna mentioned this pull request Apr 25, 2026

[Bug]: MTP × TurboQuant × CUDA graph capture produces degenerate output on Qwen3-Next hybrid (not closed by v7.13 ngram fix tree) #40880

Closed

liuyanyi mentioned this pull request Apr 27, 2026

[Bugfix] Validate post-reasoning structured output tokens in spec decode #40962

Open

This was referenced May 1, 2026

[Bugfix] Fix Qwen3Coder prev_tool_call_arr double-emission on parse failure #41466

Draft

[Bugfix] Detect MTP truncation at reasoning-to-tool-call boundary #41467

Draft

fix(spec decode): suppress EOS at draft positions in rejection sampler #41493

Draft

itachiCheng mentioned this pull request May 13, 2026

[Bug]: 使用quay.io/ascend/vllm-ascend:v0.18.0rc1镜像部署minimax-m2.7，claude code出现工具调用问题——“invalid tool parameters” vllm-project/vllm-ascend#9076

Open

mergify Bot added the needs-rebase label May 23, 2026

oneraghavan mentioned this pull request May 24, 2026

[Bugfix] Fix reasoning end token missed by should_advance under async scheduling + spec decode #43526

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Grammar was ignored when reasoning ended within speculated tokens#36138

[Bugfix] Grammar was ignored when reasoning ended within speculated tokens#36138
sfbemerk wants to merge 6 commits into
vllm-project:mainfrom
sfbemerk:bugfix/specdecode-grammar-reasoning-new

sfbemerk commented Mar 5, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

njhill commented Mar 12, 2026

Uh oh!

sfbemerk commented Mar 12, 2026 •

edited

Loading

Uh oh!

sfbemerk commented Apr 15, 2026

Uh oh!

danielwit-lb commented Apr 24, 2026

Uh oh!

Sandermage commented Apr 25, 2026

Uh oh!

cjackal commented May 7, 2026

Uh oh!

adammoisa commented May 22, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

sfbemerk commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

sfbemerk commented Mar 5, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Related

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

njhill commented Mar 12, 2026

Uh oh!

sfbemerk commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfbemerk commented Apr 15, 2026

Uh oh!

danielwit-lb commented Apr 24, 2026

Uh oh!

Sandermage commented Apr 25, 2026

What we backported

Implementation note

Empirical impact on our setup

Backport reference

Uh oh!

cjackal commented May 7, 2026

Uh oh!

adammoisa commented May 22, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

sfbemerk commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sfbemerk commented Mar 5, 2026 •

edited by github-actions Bot

Loading

sfbemerk commented Mar 12, 2026 •

edited

Loading