[SpecDec + Reasoning] Fix race condition when <channel|> reasoning-end by SvenLorenz · Pull Request #43691 · vllm-project/vllm

SvenLorenz · 2026-05-26T16:56:53Z

Hello,

Setup

I investigated further from my issue #38106. It's actually the combination of speculative decoding, reasoning and tool_choice="required". If either one of these three is disabled the bug doesn't happen. I reproduced this with these two configs:

The configs

cyankiwi/gemma-4-31B-it-AWQ-8bit
  --tool-call-parser=gemma4
  --reasoning-parser=gemma4
  --enable-auto-tool-choice
  --speculative-config '{"model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'
lukealonso/Qwen3.5-397B-A17B-NVFP4
    --enable-auto-tool-choice
    --tool-call-parser qwen3_coder
    --reasoning-parser qwen3
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'

And the following request fails in roughly 75% of cases before the bugfix for me:

The request

curl -X POST http://localhost:8001/v1/chat/completions \
                          -H "Content-Type: application/json" \
                          -d '{
                        "messages": [
                            {
                                "role": "user",
                                "content": [
                                    {
                                        "text": "Detect if there is a bunny in this video.",
                                        "type": "text"
                                    },
                                    {
                                        "type": "video_url",
                                        "video_url": {
                                            "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/1080/Big_Buck_Bunny_1080_10s_1MB.mp4"
                                        }
                                    }
                                ]
                            }
                        ],
                        "model": "ai_model",
                        "max_completion_tokens": 16000,
                        "presence_penalty": 1.5,
                        "reasoning_effort": "low",
                        "stream": false,
                        "temperature": 1.0,
                        "tool_choice": "required",
                        "tools": [
                            {
                                "type": "function",
                                "function": {
                                    "name": "final_result",
                                    "description": "The final response which ends this conversation",
                                    "parameters": {
                                        "properties": {
                                            "chain_of_thought": {
                                                "description": "The chain of thought that results in the conclusion for any potential bunnies.",
                                                "type": "string"
                                            },
                                            "conclusion": {
                                                "description": "The conclusion based on the occurrences.",
                                                "enum": ["detection", "no detection", "unsure"],
                                                "type": "string"
                                            }
                                        },
                                        "required": ["chain_of_thought", "conclusion"],
                                        "title": "Bunnies",
                                        "type": "object"
                                    }
                                }
                            }
                        ],
                        "top_p": 0.95,
                        "top_k": 20,
                        "priority": 0.0,
                        "chat_template_kwargs": { "enable_thinking": true }
                    }'

The bug

When speculative decoding generates the reasoning-end token as a draft token, the old code unconditionally set reasoning_ended=True and force-fed the unconstrained bonus token to the grammar, corrupting its state. All subsequent outputs were then constrained by a corrupted grammar leading to unconstrained generation of the model, which more often than not lead to the model generating native tool calls. This then breaks the tool_choice="required" path in _parse_tool_calls_from_content, because it expects the tool calls to be a json of format list[FunctionDefinition]. The model generating native tool calls could be fixed by the suggestion I had in the issue above, but since the model could also generate something else afaik, this is not a true fix.
This bug is also stochastic, because if the reasoning-end token gets generated as the bonus token everything works.

The fix

grammar_bitmask() now does per-token reasoning-end detection: when apply_bitmask=False, each draft token is checked against the reasoner. If reasoning-end token is found mid-batch, apply_bitmask flips to True for subsequent positions (draft tokens after the end token + bonus slot), and bonus_requires_grammar=True is set.
update_from_output() has a new elif branch for bonus_requires_grammar. The bonus token is fed to the grammar:

Accepted: bonus was generated under constraint → set reasoning_ended=True, constrain all future tokens
Rejected: bonus was generated without constraint (spec decode rejected reasoning-end token draft) → reset bonus_requires_grammar=False, leave reasoning_ended=False, grammar state untouched. The model continues reasoning naturally and reasoning-end token will appear in a future batch.
accept_tokens() has a suppress_accept_errors flag to avoid ERROR log spam from expected rejections in this path.

Test Plan

Four unit tests were added to tests/v1/structured_output/test_reasoning_structured_output.py, using mocked reasoners and grammars to validate the grammar_bitmask() per-token reasoning-end detection logic:

test_grammar_bitmask_reasoning_ends_mid_batch — <channel|> is the last draft token ([10, 20, 30, 99]). Verifies: only the bonus slot (idx 4) gets a constrained bitmask, bonus_requires_grammar=True is set, reasoning_ended stays False, and no accept_tokens calls are made for draft positions.
test_grammar_bitmask_reasoning_ends_mid_draft — <channel|> is at position 1 ([10, 99, 30, 40]). Verifies: positions 2, 3, and bonus (idx 2-4) get constrained bitmasks, accept_tokens is skipped for all positions (draft tokens after <channel|> were generated unconstrained), bonus_requires_grammar=True, no rollback called.
test_grammar_bitmask_no_reasoning_end — No end token in batch ([10, 20, 30, 40]). Verifies: all positions unconstrained, reasoning_ended unchanged, no fill_bitmask calls.
test_grammar_bitmask_reasoning_already_ended — reasoning_ended=True before the batch. Verifies: all 5 positions (4 draft + 1 bonus) constrained, 4 accept_tokens calls made, rollback called — the normal pre-existing path is unchanged.

Test Result

All tests for tests/v1/structured_output are passing.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

token appears as a rejected draft token When speculative decoding generates the <channel|> reasoning-end token as a draft token that gets rejected, the old code unconditionally set reasoning_ended=True and force-fed the unconstrained bonus token to the grammar, corrupting its state. All subsequent outputs were then constrained by a corrupted grammar. Fix: - Detect <channel|> mid-draft-batch in grammar_bitmask() and set bonus_requires_grammar=True to flag that the bonus slot should get a constrained bitmask and the grammar needs advancing - In update_from_output(), only mark reasoning as ended and advance the grammar when the bonus token is actually accepted (meaning the reasoning-end draft token was accepted by spec decode) - When the bonus token is rejected, leave reasoning_ended=False so the model continues generating reasoning text naturally - Add suppress_accept_errors flag to avoid ERROR-level log spam from expected grammar rejections in this path Signed-off-by: SvenLorenz <sven.m.lorenz@gmail.com>

github-actions · 2026-05-26T16:57:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

SvenLorenz requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, benchislett, heheda12345, mgoin, njhill, orozery, robertgshaw2-redhat, russellb and ywang96 as code owners May 26, 2026 16:56

mergify Bot added structured-output v1 labels May 26, 2026

github-project-automation Bot added this to Structured Output May 26, 2026

alexbi29 mentioned this pull request Jun 1, 2026

feat: split temperature for reasoning vs answer phase alexbi29/vllm#3

Open

z-priyanshu mentioned this pull request Jun 4, 2026

[Bugfix] Fix Gemma4 tool call parser using vocab key instead of decoded token string #44532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SpecDec + Reasoning] Fix race condition when <channel|> reasoning-end#43691

[SpecDec + Reasoning] Fix race condition when <channel|> reasoning-end#43691
SvenLorenz wants to merge 1 commit into
vllm-project:mainfrom
SvenLorenz:main

SvenLorenz commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SvenLorenz commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

The bug

The fix

Test Plan

Test Result

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SvenLorenz commented May 26, 2026 •

edited

Loading