[Bugfix] Grammar was ignored when reasoning ended within speculated tokens#36138
[Bugfix] Grammar was ignored when reasoning ended within speculated tokens#36138sfbemerk wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request fixes a bug where grammar constraints were not applied to speculative tokens generated after a reasoning-end marker. The changes correctly identify when reasoning ends within a batch of tokens and ensure that only post-reasoning tokens are validated against the grammar. The patch modifies the StructuredOutputManager to detect the reasoning end within token batches and provides new helper methods to split tokens accordingly. The Scheduler is updated to use this new logic when advancing the grammar and validating speculative tokens. The changes are supported by a comprehensive set of new unit tests. My main feedback is to refactor duplicated code in the Scheduler for better maintainability.
|
Thanks @sfbemerk. This appears to add quite a lot of code and complexity, would be good if we can find a much simpler fix. |
|
Hi @njhill , thanks for looking into this. Fully agreed - a simpler solution is always better. The original issue is that in all places where draft tokens and grammar interact (in My current approach is: split the batch of speculated tokens where reasoning_end is detected, and then let the first part pass through as unconstrained tokens, while the second part (constrained tokens) is validated by the grammar. I added a method If you have ideas to solve the underlying issue without such complexity, I am happy to follow your suggestions! |
…soning speculated tokens Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
…ing review suggestion Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Previously, a reasoning_end draft token appearing mid-batch would skip grammar constraints if reasoning_ended=True had already been set in a previous step. Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
6d5e65b to
94f4dc2
Compare
|
I rebased this pull request. @njhill It would be great if you could find some time to review the changes or think about a completely different approach. |
|
Up, this is important fix, without it speculative decoding is effectively useless |
|
@sfbemerk — first, thank you for this PR. The clarity of your "spec_token_ids was overloaded with two different meanings" analysis on the parent issues (#34650 + #31858) was a turning point in our investigation — once that diagnosis was on the table, we could stop chasing model-output theories and start auditing the spec-decode + structured-output timing path. Backported on a Qwen3.6-35B-A3B-FP8 production rig (2× A5000, vLLM What we backportedAll three of your changes:
Implementation noteKept Empirical impact on our setupStandalone delta of P62 (your fix) on top of the GDN+ngram fix from #40738 was ~3% incremental clean rate (53% → 56% on n=30 reproducer). Smaller than expected — turned out our dominant residual mode was actually ngram acceptance bias toward XML-repeat patterns, not the structured-output reasoning timing your PR addresses. Why the small delta in our specific case: we run Backport reference
Hope the data point helps the review. Cleanly-implemented PR — anchors held perfectly against dev205, no hidden dependencies. |
|
It seems this PR need a rebase after #41199 |
|
Hi @sfbemerk — thanks for the original analysis here, it pointed us at the right area of the code. We hit this bug hard on gpt-oss-120b + EAGLE3 + I just opened #43424 with a generalization of your approach: same insight (validate reasoning-end-aware), but the validation moves pre-commit so it also handles the related failure mode where verifier bonus tokens are sampled without the grammar mask engaged (the mask is gated on If you'd rather land your PR first and have us layer on top, happy to rework #43424 as a stacked PR instead. Either way, want to make sure your work gets the lineage it deserves. cc the maintainers since the gpt-oss case is biting users in production. |
|
This pull request has merge conflicts that must be resolved before it can be |
Activates the pre-commit grammar filter added in the previous commit. Before _update_request_with_output appends new tokens to the request, the scheduler calls StructuredOutputManager.precommit_filter_tokens and, when any trailing tokens are rejected, truncates new_token_ids and decrements num_computed_tokens and num_output_placeholders by the rejected count. This mirrors the existing path for verifier-rejected speculative tokens (see lines 1361-1373 in Scheduler.update_from_output). The bug this addresses: in speculative decoding with a reasoning parser, the grammar bitmask is gated on reasoning_ended (see StructuredOutputManager.should_fill_bitmask). The boundary step's bonus tokens are sampled WITHOUT the mask engaged — reasoning_ended is False at sample time and only flips True after the boundary is observed. The existing should_advance deferral correctly suppresses the post-commit accept_tokens call on the boundary step (avoiding spurious FINISHED_ERROR), but the bonus tokens are already in the response stream as garbage prefixes before the valid JSON. Repro: gpt-oss-120b + EAGLE3 + response_format json_schema strict shows the failure on ~54% of requests on main HEAD with the Harmony multi-token marker. With this patch it drops to <5% in our 300-trace shadow. Refs: vllm-project#36138 (single-token version of this fix by sfbemerk), vllm-project#43338. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Adam Moisa <adammoisa@gmail.com>
|
Thanks, @adammoisa for improving the fix. I don't mind having your implementation merged instead. I am more interested in getting ANY fix for the issue merged ;-) |
Purpose
This PR attempts to fix a bug (#31858, #34650) when Speculative Decoding (such as MTP), Reasoning, and Structured Output / Grammar are used in combination: typically, grammar is not enabled during reasoning but only for the final answer. However, when the reasoning end token is generated, any subsequent draft tokens are not validated against the grammar, leading to an invalid final answer.
Test Plan
In general, the bug seems to be independent of the specific SpecDecode method; originally I had observed it with DeepSeek models and MTP, but for testing I recommend a smaller model like Qwen3-8B and using the same model as draft model. This way, we have high acceptance rates for our tests and a high likelihood that the original bug appears.
The test request should have
response_format=json_schemaand a prompt that lurkes the model into generating not pure json, e.g.example payload
{ "model": "Qwen/Qwen3-8B", "messages": [ { "role": "user", "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```" } ], "response_format": { "type": "json_schema", "json_schema": { "name": "hero", "schema": { "$defs": { "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"} }, "properties": { "name": {"description": "Character name", "title": "Name", "type": "string"}, "age": {"description": "Character age", "title": "Age", "type": "integer"}, "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"} }, "required": ["name", "age", "role"], "title": "Character", "type": "object" } } } }The original bug can also be reproduced for Model Runner V2, this bugfix works there as well. For testing, you should choose a different speculative method (since
draft_modelis not supported yet):The original bug is still present in vllm v0.17.1.
Test Result
without bugfix, the
contentfield contains invalid json, e.g. because of markdown fenceswith the bugfix, the
contentfield contains valid json that satisfies the requested grammarI am happy to receive feedback and suggestions on how to improve the PR: the interplay of spec decode, grammar, reasoning, and async scheduling seems to be quite complex.
Related
There had been several attempts to fix this bug before: my first attempt in #34241 would reject all speculated tokens in the step where reasoning_end was detected, which was working fine, but was suboptimal. #34978 started with a better approach that would validate all speculative tokens following reasoning_end, but contained some bugs in the end and was discontinued.