Skip to content

fix: DeepSeek-R1 structured-output reasoning end detection (scheduler + parser)#34978

Closed
nbethala wants to merge 8 commits into
vllm-project:mainfrom
nbethala:fix-deepseek-r1-structured-output
Closed

fix: DeepSeek-R1 structured-output reasoning end detection (scheduler + parser)#34978
nbethala wants to merge 8 commits into
vllm-project:mainfrom
nbethala:fix-deepseek-r1-structured-output

Conversation

@nbethala
Copy link
Copy Markdown

@nbethala nbethala commented Feb 20, 2026

Fix: Structured Output + Speculative Decoding Stability (DeepSeek‑R1)

Summary
This PR resolves instability when using:

  • response_format = json_schema
  • xgrammar backend
  • reasoning_parser = qwen3
  • speculative decoding (num_speculative_tokens > 1)

Previously, speculative draft tokens that violated the JSON grammar most notably </think> (token ID 151649)—could cause:

  • incorrect reasoning‑boundary detection
  • FSM advance failures inside xgrammar
  • fatal assertions and engine crashes

Changes

69c180b0f

  • Update StructuredOutputManager.should_advance() to accept new_token_ids
  • Use actual newly produced tokens instead of counter‑derived slices
  • Fix reasoning‑end detection under speculative decoding

b71cbab18

  • Fix scheduler integration so correct token batches are passed into should_advance
  • Ensure reasoning‑end detection works for:
    - normal decode path
    - speculative accepted tokens
    - speculative bonus token

939f6d3a7

  • In backend_xgrammar.accept_tokens():
    - Treat rejected tokens as non‑fatal
    - Downgrade error‑level failures to debug logs
    - Allow speculative decoding to gracefully reject invalid draft continuations (e.g., </think>)

Reproduction (Public)

Server

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --tensor-parallel-size 1 \
  --reasoning-parser qwen3 \
  --structured-outputs-config '{"backend":"xgrammar","reasoning_parser":"qwen3"}' \
  --speculative-config '{"method":"draft_model","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","num_speculative_tokens":5}'

Stress test

for i in {1..50}; do
  curl -s http://localhost:8000/v1/chat/completions ...
done

Before

  • FSM failure on </think>
  • engine abort

After

  • 50+ sequential requests succeed
  • HTTP 200
  • no crash
  • no fatal FSM failure

Notes
Structured‑output correctness (e.g., preamble text before {) appears to be a separate issue and is not addressed in this PR.


UPDATE (Feb 20): Rewrote the summary for clarity . The PR fixes a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.

Summary
This PR completes DeepSeek‑R1 structured‑output integration by fixing a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.

Problem:
DeepSeek‑R1 must detect the </think> token to transition from free‑form reasoning into JSON‑constrained structured output. While the scheduler was updated to pass new_token_ids for reasoning‑end detection, StructuredOutputManager.should_advance() still used the old signature, causing it to ignore reasoning‑end state and fail to advance.

Fix:
Update StructuredOutputManager.should_advance() to accept new_token_ids, aligning it with scheduler call sites and enabling correct reasoning‑end transitions.

Impact

  • Structured‑output requests now advance correctly after reasoning ends
  • No behavior change for non‑structured‑output requests
  • All checks pass

Fixes #34650
Related: #34241, #31858

This PR fixes a severe issue where DeepSeek-R1 fails to detect the </think>
end-of-reasoning token when structured outputs and speculative decoding (MTP)
are enabled together.

Root cause:

  • The scheduler did not pass new_token_ids into all should_advance() call sites.
    Under speculative decoding, num_computed_tokens is incremented before tokens
    are appended, causing the delta slice to be empty.
  • The DeepSeek-R1 reasoning parser lacked an is_reasoning_end_streaming() check,
    so </think> was never detected during streaming.

Fix:

  • Pass new_token_ids into all should_advance() call sites (including speculative decoding).
  • Add is_reasoning_end_streaming() to DeepSeekR1ReasoningParser.

This allows the FSM to correctly detect </think> and transition into JSON mode.

Fixes #34650
Related: #34241, #31858

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify Bot added deepseek Related to DeepSeek models v1 labels Feb 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an issue with DeepSeek-R1 structured output and speculative decoding by passing new_token_ids to should_advance and updating the reasoning parser. While the logic of passing new_token_ids is correct, all three changes in vllm/v1/core/sched/scheduler.py introduce critical indentation errors that will cause the program to fail with a syntax error. These indentation issues must be fixed.

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@@ -1382,7 +1382,7 @@ def update_from_output(
):
new_logprobs = logprobs.slice_request(req_index, len(new_token_ids))

if new_token_ids and self.structured_output_manager.should_advance(request):
if new_token_ids and self.structured_output_manager.should_advance(request, new_token_ids):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This if statement has been incorrectly indented. It should be at the same indentation level as the if statement on line 1378. The current indentation will lead to a syntax error because the following lines (1386-1389) are not correctly indented as the body of this if block.

            if new_token_ids and self.structured_output_manager.should_advance(request, new_token_ids):

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@@ -1607,7 +1607,7 @@ def update_draft_token_ids(self, draft_token_ids: DraftTokenIds) -> None:
continue

# Add newly generated spec token ids to the request.
if self.structured_output_manager.should_advance(request):
if self.structured_output_manager.should_advance(request, spec_token_ids):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This if statement has been incorrectly indented. It is now at a deeper level, but its body on the following lines (1611-1612) has not been indented, which will cause a syntax error. This if block should not be indented.

            if self.structured_output_manager.should_advance(request, spec_token_ids):

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@@ -1636,7 +1636,7 @@ def update_draft_token_ids_in_output(
# (needed for chunked prefill case for example).
del spec_token_ids[orig_num_spec_tokens:]
# Filter out spec tokens which do not adhere to the grammar.
if self.structured_output_manager.should_advance(request):
if self.structured_output_manager.should_advance(request, spec_token_ids):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This if statement has been incorrectly indented. It is now at a deeper level, but its body on the following lines (1640-1642) has not been indented accordingly. This will cause a syntax error. The if statement should not be indented.

Suggested change
if self.structured_output_manager.should_advance(request, spec_token_ids):
if self.structured_output_manager.should_advance(request, spec_token_ids):

@dosubot
Copy link
Copy Markdown

dosubot Bot commented Feb 20, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@nbethala
Copy link
Copy Markdown
Author

Updated the PR description with a concise summary of the fix. All checks except Buildkite have completed, and the remaining pipeline is pending in the queue. This is ready for review.

@@ -25,6 +25,9 @@ def end_token(self) -> str:
"""The token that ends reasoning content."""
return "</think>"

def is_reasoning_end_streaming(self, all_token_ids, new_token_ids):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really necessary here?
To me, it looks as if the parent class already implements the same logic, see https://github.com/nbethala/vllm/blob/97505c01ba777acaef0c259dd4d9d959d800703c/vllm/reasoning/basic_parsers.py#L79-L83

@sfbemerk
Copy link
Copy Markdown
Contributor

Hi, I just ran your branch with my test setup from #34241

vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"draft_model","model":"Qwen/Qwen3-8B","num_speculative_tokens":5}'

and payload

{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {
      "role": "user",
      "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```"
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "hero",
      "schema": {
        "$defs": {
          "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"}
        },
        "properties": {
          "name": {"description": "Character name", "title": "Name", "type": "string"},
          "age": {"description": "Character age", "title": "Age", "type": "integer"},
          "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"}
        },
        "required": ["name", "age", "role"],
        "title": "Character",
        "type": "object"
      }
    }
  }
}

Unfortunately, it fails with

(EngineCore_DP0 pid=770) Process EngineCore_DP0:
(EngineCore_DP0 pid=770) Traceback (most recent call last):
(EngineCore_DP0 pid=770)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=770)     self.run()
(EngineCore_DP0 pid=770)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=770)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1040, in run_engine_core
(EngineCore_DP0 pid=770)     raise e
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1029, in run_engine_core
(EngineCore_DP0 pid=770)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1056, in run_busy_loop
(EngineCore_DP0 pid=770)     self._process_engine_step()
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1092, in _process_engine_step
(EngineCore_DP0 pid=770)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=770)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 395, in step
(EngineCore_DP0 pid=770)     grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
(EngineCore_DP0 pid=770)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/sched/scheduler.py", line 1247, in get_grammar_bitmask
(EngineCore_DP0 pid=770)     bitmask = self.structured_output_manager.grammar_bitmask(
(EngineCore_DP0 pid=770)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=770)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/structured_output/__init__.py", line 268, in grammar_bitmask
(EngineCore_DP0 pid=770)     assert accepted, (token, req_id, scheduled_spec_decode_tokens)
(EngineCore_DP0 pid=770)            ^^^^^^^^
(EngineCore_DP0 pid=770) AssertionError: (975, 'chatcmpl-198b3ec9bc4245d3a12053cd05a92557-0888-bc0b5598', {'chatcmpl-198b3ec9bc4245d3a12053cd05a92557-0888-bc0b5598': [975, 624, 151668, 271, 73594]})

Maybe you could test and revisit your bugfix again for num-speculative-tokens > 1? I'd recommend above Test setup with draft model = full model (and then using a tiny model), because then we have the highest acceptance rates for very large numbers of speculative tokens.

@nbethala
Copy link
Copy Markdown
Author

Update:

I was able to fully reproduce the same failure on my side using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with num_speculative_tokens=5 (draft model = full model) and a JSON-schema response_format.

The crash occurs in StructuredOutputManager.grammar_bitmask() when a speculative draft token is rejected by the JSON grammar:

assert accepted, (token, req_id, scheduled_spec_decode_tokens)

In my reproduction, the offending speculative token decodes to </think> . With multi-token speculative decoding enabled, the draft sequence can include tokens (e.g., reasoning delimiters) that are not valid under the JSON grammar at that point. The current implementation asserts on this condition rather than handling the invalid speculative branch gracefully.

Next steps:
I’m going to patch grammar_bitmask() so that invalid speculative sequences are handled safely (e.g., truncate on first rejection instead of asserting), and then re-run the same harness across num_speculative_tokens = {1, 2, 5, 10}.

I’ll follow up with a minimal patch and a regression test that covers this interaction.

@nbethala nbethala force-pushed the fix-deepseek-r1-structured-output branch from 97505c0 to b71cbab Compare February 23, 2026 07:14
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 23, 2026

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

2 similar comments
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 23, 2026

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 23, 2026

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
…n_ids

Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
@nbethala nbethala force-pushed the fix-deepseek-r1-structured-output branch from 50b3d3e to 44c209e Compare February 23, 2026 18:44
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 23, 2026

Hi @nbethala, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
@nbethala
Copy link
Copy Markdown
Author

FYI: local pre-commit hook mypy-local currently fails on origin/main:

vllm/v1/worker/gpu_worker.py:760: error: Item "None" of "TorchProfilerWrapper | CudaProfilerWrapper | None" has no attribute "start" [union-attr]

Repro:
git checkout origin/main
uv pip install pre-commit
pre-commit run mypy-local --all-files

Also, buildkite/ci/pr is still “Expected — Waiting for status to be reported” with no details link. Can someone retrigger Buildkite or check the GitHub status integration?

@sfbemerk
Copy link
Copy Markdown
Contributor

I think the current branch is still not 100% working. When I run vllm with my tests and your branch, the token is dropped, so reasoning never ends. The JSON is generated correctly, but at the end of "reasoning_content", not in "content".

{
  "role": "assistant",
  "content": null,
  "reasoning_content": "\nOkay, the user wants me to imagine a fantasy hero [...] Now, make sure there are no typos and that the JSON{ \"name\": \"Eldrin the Stormborn\", \"age\": 28, \"role\": \"warrior\" }",
  "tool_calls": null
}

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @nbethala.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Feb 24, 2026
…rsor

Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
@nbethala nbethala force-pushed the fix-deepseek-r1-structured-output branch from cb847d0 to 78de290 Compare February 24, 2026 21:36
@nbethala
Copy link
Copy Markdown
Author

Upstream now includes the islice‑based reasoning‑end detection, which addresses the core issue this PR targeted. Since the fix is already covered on main, I’m closing this PR.

Thanks for the guidance and review throughout the debugging process it helped me understand the speculative decoding + structured output interaction much more deeply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

Bug: Speculative Decoding (MTP) Causes </think> Detection Failure in Structured Output + Reasoning Mode

2 participants