Skip to content

[Bugfix] Grammar ignored when reasoning ends within speculated tokens#34241

Closed
sfbemerk wants to merge 3 commits into
vllm-project:mainfrom
sfbemerk:bugfix/specdecode-grammar-reasoning-pr
Closed

[Bugfix] Grammar ignored when reasoning ends within speculated tokens#34241
sfbemerk wants to merge 3 commits into
vllm-project:mainfrom
sfbemerk:bugfix/specdecode-grammar-reasoning-pr

Conversation

@sfbemerk
Copy link
Copy Markdown
Contributor

@sfbemerk sfbemerk commented Feb 10, 2026

Purpose

This PR attempts to fix a bug (#31858) when Speculative Decoding (such as MTP), Reasoning, and Structured Output / Grammar are used in combination: typically, grammar is not enabled during reasoning but only for the final answer. However, when the reasoning end token is generated, any subsequent draft tokens are not validated against the grammar, leading to an invalid final answer.

Test Plan

In general, the bug seems to be independent of the specific SpecDecode method; originally I had observed it with DeepSeek models and MTP, but for testing I recommend a smaller model like Qwen3-8B and using the same model as draft model. This way, we have high acceptance rates for our tests and a high likelihood that the original bug appears.

vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"draft_model","model":"Qwen/Qwen3-8B","num_speculative_tokens":5}'

The test request should have response_format=json_schema and a prompt that lurkes the model into generating not pure json, e.g.

example payload { "model": "Qwen/Qwen3-8B", "messages": [ { "role": "user", "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```" } ], "response_format": { "type": "json_schema", "json_schema": { "name": "hero", "schema": { "$defs": { "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"} }, "properties": { "name": {"description": "Character name", "title": "Name", "type": "string"}, "age": {"description": "Character age", "title": "Age", "type": "integer"}, "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"} }, "required": ["name", "age", "role"], "title": "Character", "type": "object" } } } }

Test Result

without bugfix, the content field contains invalid json, e.g. because of markdown fences

"content": "```json\n{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}```"

with the bugfix, the content field contains valid json that satisfies the requested grammar

"content": "{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}"

I am happy to receive feedback and suggestions on how to improve the PR: the interplay of spec decode, grammar, reasoning, and async scheduling seems to be quite complex. I found the first commits with bugfix attempts in the vllm-chutes fork but had to make a few more additions.

@mergify mergify Bot added structured-output v1 bug Something isn't working labels Feb 10, 2026
@sfbemerk sfbemerk changed the title [Bugfix] Grammar ignored when reasoning ends in speculated tokens [Bugfix] Grammar ignored when reasoning ends within speculated tokens Feb 10, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses a complex bug involving the interaction of speculative decoding, reasoning, and structured output. The new test case, test_reasoning_spec_decode_grammar_comprehensive, is well-structured and crucial for validating the fix across various scenarios. The logic introduced to manage speculative tokens and apply grammar constraints during the reasoning-to-structured-output transition appears sound and robust. However, there are several instances where full copies of request.all_token_ids are created, which could lead to significant performance overhead and increased memory usage, especially for long sequences. Optimizing these operations to avoid unnecessary list copying would be a critical improvement.

Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/structured_output/__init__.py Outdated
Comment thread vllm/v1/structured_output/__init__.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/structured_output/__init__.py Outdated
Comment thread vllm/v1/structured_output/__init__.py Outdated
Comment thread tests/v1/core/test_scheduler.py
Comment thread vllm/v1/core/sched/scheduler.py
Comment thread tests/v1/core/test_scheduler.py
Comment thread vllm/v1/structured_output/__init__.py Outdated
@sfbemerk sfbemerk force-pushed the bugfix/specdecode-grammar-reasoning-pr branch 2 times, most recently from f81c6c0 to b61dd96 Compare February 16, 2026 20:39
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
@sfbemerk sfbemerk force-pushed the bugfix/specdecode-grammar-reasoning-pr branch 2 times, most recently from 5473327 to e6fb6e5 Compare February 16, 2026 23:08
Benjamin Merkel added 2 commits February 17, 2026 08:23
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working structured-output v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants