[Bugfix] Grammar ignored when reasoning ends within speculated tokens#34241
[Bugfix] Grammar ignored when reasoning ends within speculated tokens#34241sfbemerk wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
The pull request effectively addresses a complex bug involving the interaction of speculative decoding, reasoning, and structured output. The new test case, test_reasoning_spec_decode_grammar_comprehensive, is well-structured and crucial for validating the fix across various scenarios. The logic introduced to manage speculative tokens and apply grammar constraints during the reasoning-to-structured-output transition appears sound and robust. However, there are several instances where full copies of request.all_token_ids are created, which could lead to significant performance overhead and increased memory usage, especially for long sequences. Optimizing these operations to avoid unnecessary list copying would be a critical improvement.
f81c6c0 to
b61dd96
Compare
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
5473327 to
e6fb6e5
Compare
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
e6fb6e5 to
5a5e6b5
Compare
Purpose
This PR attempts to fix a bug (#31858) when Speculative Decoding (such as MTP), Reasoning, and Structured Output / Grammar are used in combination: typically, grammar is not enabled during reasoning but only for the final answer. However, when the reasoning end token is generated, any subsequent draft tokens are not validated against the grammar, leading to an invalid final answer.
Test Plan
In general, the bug seems to be independent of the specific SpecDecode method; originally I had observed it with DeepSeek models and MTP, but for testing I recommend a smaller model like Qwen3-8B and using the same model as draft model. This way, we have high acceptance rates for our tests and a high likelihood that the original bug appears.
The test request should have
response_format=json_schemaand a prompt that lurkes the model into generating not pure json, e.g.example payload
{ "model": "Qwen/Qwen3-8B", "messages": [ { "role": "user", "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```" } ], "response_format": { "type": "json_schema", "json_schema": { "name": "hero", "schema": { "$defs": { "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"} }, "properties": { "name": {"description": "Character name", "title": "Name", "type": "string"}, "age": {"description": "Character age", "title": "Age", "type": "integer"}, "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"} }, "required": ["name", "age", "role"], "title": "Character", "type": "object" } } } }Test Result
without bugfix, the
contentfield contains invalid json, e.g. because of markdown fenceswith the bugfix, the
contentfield contains valid json that satisfies the requested grammarI am happy to receive feedback and suggestions on how to improve the PR: the interplay of spec decode, grammar, reasoning, and async scheduling seems to be quite complex. I found the first commits with bugfix attempts in the vllm-chutes fork but had to make a few more additions.