fix: DeepSeek-R1 structured-output reasoning end detection (scheduler + parser) by nbethala · Pull Request #34978 · vllm-project/vllm

nbethala · 2026-02-20T18:32:48Z

Fix: Structured Output + Speculative Decoding Stability (DeepSeek‑R1)

Summary
This PR resolves instability when using:

response_format = json_schema
xgrammar backend
reasoning_parser = qwen3
speculative decoding (num_speculative_tokens > 1)

Previously, speculative draft tokens that violated the JSON grammar most notably </think> (token ID 151649)—could cause:

incorrect reasoning‑boundary detection
FSM advance failures inside xgrammar
fatal assertions and engine crashes

Changes

69c180b0f

Update StructuredOutputManager.should_advance() to accept new_token_ids
Use actual newly produced tokens instead of counter‑derived slices
Fix reasoning‑end detection under speculative decoding

b71cbab18

Fix scheduler integration so correct token batches are passed into should_advance
Ensure reasoning‑end detection works for:
- normal decode path
- speculative accepted tokens
- speculative bonus token

939f6d3a7

In backend_xgrammar.accept_tokens():
- Treat rejected tokens as non‑fatal
- Downgrade error‑level failures to debug logs
- Allow speculative decoding to gracefully reject invalid draft continuations (e.g., </think>)

Reproduction (Public)

Server

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --tensor-parallel-size 1 \
  --reasoning-parser qwen3 \
  --structured-outputs-config '{"backend":"xgrammar","reasoning_parser":"qwen3"}' \
  --speculative-config '{"method":"draft_model","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","num_speculative_tokens":5}'

Stress test

for i in {1..50}; do
  curl -s http://localhost:8000/v1/chat/completions ...
done

Before

FSM failure on </think>
engine abort

After

50+ sequential requests succeed
HTTP 200
no crash
no fatal FSM failure

Notes
Structured‑output correctness (e.g., preamble text before {) appears to be a separate issue and is not addressed in this PR.

UPDATE (Feb 20): Rewrote the summary for clarity . The PR fixes a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.

Summary
This PR completes DeepSeek‑R1 structured‑output integration by fixing a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.

Problem:
DeepSeek‑R1 must detect the </think> token to transition from free‑form reasoning into JSON‑constrained structured output. While the scheduler was updated to pass new_token_ids for reasoning‑end detection, StructuredOutputManager.should_advance() still used the old signature, causing it to ignore reasoning‑end state and fail to advance.

Fix:
Update StructuredOutputManager.should_advance() to accept new_token_ids, aligning it with scheduler call sites and enabling correct reasoning‑end transitions.

Impact

Structured‑output requests now advance correctly after reasoning ends
No behavior change for non‑structured‑output requests
All checks pass

Fixes #34650
Related: #34241, #31858

This PR fixes a severe issue where DeepSeek-R1 fails to detect the </think>
end-of-reasoning token when structured outputs and speculative decoding (MTP)
are enabled together.

Root cause:

The scheduler did not pass new_token_ids into all should_advance() call sites.
Under speculative decoding, num_computed_tokens is incremented before tokens
are appended, causing the delta slice to be empty.
The DeepSeek-R1 reasoning parser lacked an is_reasoning_end_streaming() check,
so </think> was never detected during streaming.

Fix:

Pass new_token_ids into all should_advance() call sites (including speculative decoding).
Add is_reasoning_end_streaming() to DeepSeekR1ReasoningParser.

This allows the FSM to correctly detect </think> and transition into JSON mode.

Fixes #34650
Related: #34241, #31858

github-actions · 2026-02-20T18:32:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request aims to fix an issue with DeepSeek-R1 structured output and speculative decoding by passing new_token_ids to should_advance and updating the reasoning parser. While the logic of passing new_token_ids is correct, all three changes in vllm/v1/core/sched/scheduler.py introduce critical indentation errors that will cause the program to fail with a syntax error. These indentation issues must be fixed.

gemini-code-assist · 2026-02-20T18:34:28Z

@@ -1382,7 +1382,7 @@ def update_from_output(
            ):
                new_logprobs = logprobs.slice_request(req_index, len(new_token_ids))

-            if new_token_ids and self.structured_output_manager.should_advance(request):
+                if new_token_ids and self.structured_output_manager.should_advance(request, new_token_ids):


This if statement has been incorrectly indented. It should be at the same indentation level as the if statement on line 1378. The current indentation will lead to a syntax error because the following lines (1386-1389) are not correctly indented as the body of this if block.

if new_token_ids and self.structured_output_manager.should_advance(request, new_token_ids):

gemini-code-assist · 2026-02-20T18:34:28Z

@@ -1607,7 +1607,7 @@ def update_draft_token_ids(self, draft_token_ids: DraftTokenIds) -> None:
                continue

            # Add newly generated spec token ids to the request.
-            if self.structured_output_manager.should_advance(request):
+                if self.structured_output_manager.should_advance(request, spec_token_ids):


This if statement has been incorrectly indented. It is now at a deeper level, but its body on the following lines (1611-1612) has not been indented, which will cause a syntax error. This if block should not be indented.

if self.structured_output_manager.should_advance(request, spec_token_ids):

gemini-code-assist · 2026-02-20T18:34:28Z

@@ -1636,7 +1636,7 @@ def update_draft_token_ids_in_output(
            # (needed for chunked prefill case for example).
            del spec_token_ids[orig_num_spec_tokens:]
            # Filter out spec tokens which do not adhere to the grammar.
-            if self.structured_output_manager.should_advance(request):
+                if self.structured_output_manager.should_advance(request, spec_token_ids):


This if statement has been incorrectly indented. It is now at a deeper level, but its body on the following lines (1640-1642) has not been indented accordingly. This will cause a syntax error. The if statement should not be indented.

Suggested change

if self.structured_output_manager.should_advance(request, spec_token_ids):

if self.structured_output_manager.should_advance(request, spec_token_ids):

dosubot · 2026-02-20T23:47:20Z

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

nbethala · 2026-02-20T23:47:44Z

Updated the PR description with a concise summary of the fix. All checks except Buildkite have completed, and the remaining pipeline is pending in the queue. This is ready for review.

sfbemerk · 2026-02-22T18:09:02Z

@@ -25,6 +25,9 @@ def end_token(self) -> str:
        """The token that ends reasoning content."""
        return "</think>"

+    def is_reasoning_end_streaming(self, all_token_ids, new_token_ids):


is this really necessary here?
To me, it looks as if the parent class already implements the same logic, see https://github.com/nbethala/vllm/blob/97505c01ba777acaef0c259dd4d9d959d800703c/vllm/reasoning/basic_parsers.py#L79-L83

sfbemerk · 2026-02-22T19:29:18Z

Hi, I just ran your branch with my test setup from #34241

vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"draft_model","model":"Qwen/Qwen3-8B","num_speculative_tokens":5}'

and payload

{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {
      "role": "user",
      "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```"
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "hero",
      "schema": {
        "$defs": {
          "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"}
        },
        "properties": {
          "name": {"description": "Character name", "title": "Name", "type": "string"},
          "age": {"description": "Character age", "title": "Age", "type": "integer"},
          "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"}
        },
        "required": ["name", "age", "role"],
        "title": "Character",
        "type": "object"
      }
    }
  }
}

Unfortunately, it fails with

(EngineCore_DP0 pid=770) Process EngineCore_DP0:
(EngineCore_DP0 pid=770) Traceback (most recent call last):
(EngineCore_DP0 pid=770)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=770)     self.run()
(EngineCore_DP0 pid=770)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=770)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1040, in run_engine_core
(EngineCore_DP0 pid=770)     raise e
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1029, in run_engine_core
(EngineCore_DP0 pid=770)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1056, in run_busy_loop
(EngineCore_DP0 pid=770)     self._process_engine_step()
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1092, in _process_engine_step
(EngineCore_DP0 pid=770)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=770)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 395, in step
(EngineCore_DP0 pid=770)     grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
(EngineCore_DP0 pid=770)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/sched/scheduler.py", line 1247, in get_grammar_bitmask
(EngineCore_DP0 pid=770)     bitmask = self.structured_output_manager.grammar_bitmask(
(EngineCore_DP0 pid=770)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/structured_output/__init__.py", line 268, in grammar_bitmask
(EngineCore_DP0 pid=770)     assert accepted, (token, req_id, scheduled_spec_decode_tokens)
(EngineCore_DP0 pid=770)            ^^^^^^^^
(EngineCore_DP0 pid=770) AssertionError: (975, 'chatcmpl-198b3ec9bc4245d3a12053cd05a92557-0888-bc0b5598', {'chatcmpl-198b3ec9bc4245d3a12053cd05a92557-0888-bc0b5598': [975, 624, 151668, 271, 73594]})

Maybe you could test and revisit your bugfix again for num-speculative-tokens > 1? I'd recommend above Test setup with draft model = full model (and then using a tiny model), because then we have the highest acceptance rates for very large numbers of speculative tokens.

nbethala · 2026-02-23T00:53:02Z

Update:

I was able to fully reproduce the same failure on my side using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with num_speculative_tokens=5 (draft model = full model) and a JSON-schema response_format.

The crash occurs in StructuredOutputManager.grammar_bitmask() when a speculative draft token is rejected by the JSON grammar:

assert accepted, (token, req_id, scheduled_spec_decode_tokens)

In my reproduction, the offending speculative token decodes to </think> . With multi-token speculative decoding enabled, the draft sequence can include tokens (e.g., reasoning delimiters) that are not valid under the JSON grammar at that point. The current implementation asserts on this condition rather than handling the invalid speculative branch gracefully.

Next steps:
I’m going to patch grammar_bitmask() so that invalid speculative sequences are handled safely (e.g., truncate on first rejection instead of asserting), and then re-run the same harness across num_speculative_tokens = {1, 2, 5, 10}.

I’ll follow up with a minimal patch and a regression test that covers this interaction.

mergify · 2026-02-23T07:19:08Z

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-23T17:45:31Z

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-23T18:26:40Z

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

…n_ids Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

mergify · 2026-02-23T18:49:22Z

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

nbethala · 2026-02-23T20:10:21Z

FYI: local pre-commit hook mypy-local currently fails on origin/main:

vllm/v1/worker/gpu_worker.py:760: error: Item "None" of "TorchProfilerWrapper | CudaProfilerWrapper | None" has no attribute "start" [union-attr]

Repro:
git checkout origin/main
uv pip install pre-commit
pre-commit run mypy-local --all-files

Also, buildkite/ci/pr is still “Expected — Waiting for status to be reported” with no details link. Can someone retrigger Buildkite or check the GitHub status integration?

sfbemerk · 2026-02-23T21:52:30Z

I think the current branch is still not 100% working. When I run vllm with my tests and your branch, the token is dropped, so reasoning never ends. The JSON is generated correctly, but at the end of "reasoning_content", not in "content".

{
  "role": "assistant",
  "content": null,
  "reasoning_content": "\nOkay, the user wants me to imagine a fantasy hero [...] Now, make sure there are no typos and that the JSON{ \"name\": \"Eldrin the Stormborn\", \"age\": 28, \"role\": \"warrior\" }",
  "tool_calls": null
}

mergify · 2026-02-24T17:23:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @nbethala.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…rsor Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

nbethala · 2026-02-24T22:27:23Z

Upstream now includes the islice‑based reasoning‑end detection, which addresses the core issue this PR targeted. Since the fix is already covered on main, I’m closing this PR.

Thanks for the guidance and review throughout the debugging process it helped me understand the speculative decoding + structured output interaction much more deeply.

github-project-automation Bot added this to DeepSeek V3/R1 Feb 20, 2026

github-project-automation Bot moved this to Backlog in DeepSeek V3/R1 Feb 20, 2026

mergify Bot added deepseek Related to DeepSeek models v1 labels Feb 20, 2026

gemini-code-assist Bot reviewed Feb 20, 2026

View reviewed changes

mergify Bot added the structured-output label Feb 20, 2026

github-project-automation Bot added this to Structured Output Feb 20, 2026

nbethala force-pushed the fix-deepseek-r1-structured-output branch 4 times, most recently from c80cab6 to 97505c0 Compare February 20, 2026 22:54

nbethala marked this pull request as ready for review February 20, 2026 23:47

nbethala requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, benchislett, chaunceyjiang, heheda12345, mgoin, njhill, orozery, robertgshaw2-redhat, russellb and ywang96 as code owners February 20, 2026 23:47

sfbemerk reviewed Feb 22, 2026

View reviewed changes

nbethala force-pushed the fix-deepseek-r1-structured-output branch from 97505c0 to b71cbab Compare February 23, 2026 07:14

nbethala added 6 commits February 23, 2026 18:44

fix: DeepSeek-R1 reasoning end detection and scheduler token passing

d8faf79

Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

chore: fix formatting and indentation in scheduler.py

e2a618c

Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

fix: update StructuredOutputManager.should_advance to accept new_toke…

84ddd84

…n_ids Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

Fix structured output with speculative decoding and reasoning

7061ecd

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

xgrammar: treat rejected tokens as non-fatal under speculative decoding

f76d483

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

Fix: Handle rejected speculative tokens gracefully in grammar validation

44c209e

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

nbethala force-pushed the fix-deepseek-r1-structured-output branch from 50b3d3e to 44c209e Compare February 23, 2026 18:44

style: fix ruff formatting

ada9a69

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

mergify Bot added the needs-rebase label Feb 24, 2026

Fix reasoning end detection by scanning committed token stream via cu…

78de290

…rsor Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>

nbethala force-pushed the fix-deepseek-r1-structured-output branch from cb847d0 to 78de290 Compare February 24, 2026 21:36

sfbemerk mentioned this pull request Mar 5, 2026

[Bugfix] Grammar was ignored when reasoning ended within speculated tokens #36138

Open

nbethala closed this Mar 22, 2026

github-project-automation Bot moved this to Done in Structured Output Mar 22, 2026

github-project-automation Bot moved this from Backlog to Done in DeepSeek V3/R1 Mar 22, 2026

	if self.structured_output_manager.should_advance(request, spec_token_ids):
	if self.structured_output_manager.should_advance(request, spec_token_ids):

Uh oh!

Conversation

nbethala commented Feb 20, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes #34650 Related: #34241, #31858

Uh oh!

github-actions Bot commented Feb 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

dosubot Bot commented Feb 20, 2026

Uh oh!

nbethala commented Feb 20, 2026

Uh oh!

sfbemerk Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

sfbemerk commented Feb 22, 2026

Uh oh!

nbethala commented Feb 23, 2026

Uh oh!

mergify Bot commented Feb 23, 2026

Uh oh!

mergify Bot commented Feb 23, 2026

Uh oh!

mergify Bot commented Feb 23, 2026

Uh oh!

mergify Bot commented Feb 23, 2026

Uh oh!

nbethala commented Feb 23, 2026

Uh oh!

sfbemerk commented Feb 23, 2026

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

nbethala commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nbethala commented Feb 20, 2026 •

edited by github-actions Bot

Loading

Fixes #34650
Related: #34241, #31858