fix: DeepSeek-R1 structured-output reasoning end detection (scheduler + parser)#34978
fix: DeepSeek-R1 structured-output reasoning end detection (scheduler + parser)#34978nbethala wants to merge 8 commits into
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue with DeepSeek-R1 structured output and speculative decoding by passing new_token_ids to should_advance and updating the reasoning parser. While the logic of passing new_token_ids is correct, all three changes in vllm/v1/core/sched/scheduler.py introduce critical indentation errors that will cause the program to fail with a syntax error. These indentation issues must be fixed.
| @@ -1382,7 +1382,7 @@ def update_from_output( | |||
| ): | |||
| new_logprobs = logprobs.slice_request(req_index, len(new_token_ids)) | |||
|
|
|||
| if new_token_ids and self.structured_output_manager.should_advance(request): | |||
| if new_token_ids and self.structured_output_manager.should_advance(request, new_token_ids): | |||
There was a problem hiding this comment.
This if statement has been incorrectly indented. It should be at the same indentation level as the if statement on line 1378. The current indentation will lead to a syntax error because the following lines (1386-1389) are not correctly indented as the body of this if block.
if new_token_ids and self.structured_output_manager.should_advance(request, new_token_ids):| @@ -1607,7 +1607,7 @@ def update_draft_token_ids(self, draft_token_ids: DraftTokenIds) -> None: | |||
| continue | |||
|
|
|||
| # Add newly generated spec token ids to the request. | |||
| if self.structured_output_manager.should_advance(request): | |||
| if self.structured_output_manager.should_advance(request, spec_token_ids): | |||
There was a problem hiding this comment.
| @@ -1636,7 +1636,7 @@ def update_draft_token_ids_in_output( | |||
| # (needed for chunked prefill case for example). | |||
| del spec_token_ids[orig_num_spec_tokens:] | |||
| # Filter out spec tokens which do not adhere to the grammar. | |||
| if self.structured_output_manager.should_advance(request): | |||
| if self.structured_output_manager.should_advance(request, spec_token_ids): | |||
There was a problem hiding this comment.
This if statement has been incorrectly indented. It is now at a deeper level, but its body on the following lines (1640-1642) has not been indented accordingly. This will cause a syntax error. The if statement should not be indented.
| if self.structured_output_manager.should_advance(request, spec_token_ids): | |
| if self.structured_output_manager.should_advance(request, spec_token_ids): |
c80cab6 to
97505c0
Compare
|
Updated the PR description with a concise summary of the fix. All checks except Buildkite have completed, and the remaining pipeline is pending in the queue. This is ready for review. |
| @@ -25,6 +25,9 @@ def end_token(self) -> str: | |||
| """The token that ends reasoning content.""" | |||
| return "</think>" | |||
|
|
|||
| def is_reasoning_end_streaming(self, all_token_ids, new_token_ids): | |||
There was a problem hiding this comment.
is this really necessary here?
To me, it looks as if the parent class already implements the same logic, see https://github.com/nbethala/vllm/blob/97505c01ba777acaef0c259dd4d9d959d800703c/vllm/reasoning/basic_parsers.py#L79-L83
|
Hi, I just ran your branch with my test setup from #34241 and payload Unfortunately, it fails with Maybe you could test and revisit your bugfix again for |
|
Update: I was able to fully reproduce the same failure on my side using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with num_speculative_tokens=5 (draft model = full model) and a JSON-schema response_format. The crash occurs in StructuredOutputManager.grammar_bitmask() when a speculative draft token is rejected by the JSON grammar: In my reproduction, the offending speculative token decodes to Next steps: I’ll follow up with a minimal patch and a regression test that covers this interaction. |
97505c0 to
b71cbab
Compare
|
Hi @nbethala, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
2 similar comments
|
Hi @nbethala, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @nbethala, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
…n_ids Signed-off-by: Nancy Bethala <nancybethala2013@gmail.com> Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
50b3d3e to
44c209e
Compare
|
Hi @nbethala, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
|
FYI: local pre-commit hook vllm/v1/worker/gpu_worker.py:760: error: Item "None" of "TorchProfilerWrapper | CudaProfilerWrapper | None" has no attribute "start" [union-attr] Repro: Also, |
|
I think the current branch is still not 100% working. When I run vllm with my tests and your branch, the token is dropped, so reasoning never ends. The JSON is generated correctly, but at the end of "reasoning_content", not in "content". |
|
This pull request has merge conflicts that must be resolved before it can be |
…rsor Signed-off-by: nancy.Bethala <nancybethala2013@gmail.com>
cb847d0 to
78de290
Compare
|
Upstream now includes the islice‑based reasoning‑end detection, which addresses the core issue this PR targeted. Since the fix is already covered on main, I’m closing this PR. Thanks for the guidance and review throughout the debugging process it helped me understand the speculative decoding + structured output interaction much more deeply. |
Fix: Structured Output + Speculative Decoding Stability (DeepSeek‑R1)
Summary
This PR resolves instability when using:
Previously, speculative draft tokens that violated the JSON grammar most notably
</think>(token ID 151649)—could cause:Changes
69c180b0f
b71cbab18
- normal decode path
- speculative accepted tokens
- speculative bonus token
939f6d3a7
- Treat rejected tokens as non‑fatal
- Downgrade error‑level failures to debug logs
- Allow speculative decoding to gracefully reject invalid draft continuations (e.g.,
</think>)Reproduction (Public)
Server
Stress test
Before
</think>After
Notes
Structured‑output correctness (e.g., preamble text before {) appears to be a separate issue and is not addressed in this PR.
UPDATE (Feb 20): Rewrote the summary for clarity . The PR fixes a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.
Summary
This PR completes DeepSeek‑R1 structured‑output integration by fixing a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.
Problem:
DeepSeek‑R1 must detect the
</think>token to transition from free‑form reasoning into JSON‑constrained structured output. While the scheduler was updated to pass new_token_ids for reasoning‑end detection, StructuredOutputManager.should_advance() still used the old signature, causing it to ignore reasoning‑end state and fail to advance.Fix:
Update StructuredOutputManager.should_advance() to accept new_token_ids, aligning it with scheduler call sites and enabling correct reasoning‑end transitions.
Impact
Fixes #34650
Related: #34241, #31858
This PR fixes a severe issue where DeepSeek-R1 fails to detect the
</think>end-of-reasoning token when structured outputs and speculative decoding (MTP)
are enabled together.
Root cause:
Under speculative decoding, num_computed_tokens is incremented before tokens
are appended, causing the delta slice to be empty.
so
</think>was never detected during streaming.Fix:
This allows the FSM to correctly detect
</think>and transition into JSON mode.Fixes #34650
Related: #34241, #31858