[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings.#30056
Conversation
|
Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/ |
49dcbe7 to
cfd844c
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance improvement for structured output with single-token reasoning endings by adding the is_reasoning_end_on_decode_step method. The changes are well-structured, with updates to the core interface, implementations for various parsers, corresponding tests, and documentation. The benchmark results clearly demonstrate the effectiveness of the optimization. My main feedback is a critical performance issue in the olmo3_reasoning_parser where a major optimization opportunity was missed, which I've detailed in a specific comment.
| @@ -243,6 +243,9 @@ def is_reasoning_end(self, input_ids: list[int]) -> bool: | |||
| text = self.model_tokenizer.decode(input_ids) | |||
| return self.think_end in text | |||
|
|
|||
| def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool: | |||
| return self.is_reasoning_end(input_ids) | |||
There was a problem hiding this comment.
Calling self.is_reasoning_end() here is highly inefficient as it decodes the entire input_ids sequence on every step. This negates the performance benefits of this PR for olmo3 models.
A much more performant implementation should be used that avoids decoding the full sequence by checking only a suffix of the input_ids. Since this method is called at each step to detect the first appearance of the end marker, checking a suffix is a safe and efficient heuristic.
| return self.is_reasoning_end(input_ids) | |
| # Avoid decoding the whole sequence by checking only a suffix. | |
| suffix_len = 32 | |
| if len(input_ids) < suffix_len: | |
| text_to_check = self.model_tokenizer.decode(input_ids) | |
| else: | |
| text_to_check = self.model_tokenizer.decode(input_ids[-suffix_len:]) | |
| return self.think_end in text_to_check |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Amazing! |
docs/features/reasoning_outputs.md
Outdated
| @@ -297,6 +297,9 @@ Additionally, to enable structured output, you'll need to create a new `Reasoner | |||
|
|
|||
| def is_reasoning_end(self, input_ids: list[int]) -> bool: | |||
| return self.end_token_id in input_ids | |||
|
|
|||
| def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool: | |||
There was a problem hiding this comment.
I suggest adding a **kwargs parameter to is_reasoning_end. What do you think?
There was a problem hiding this comment.
It's currently known that there are some reasoning_parser plugins outside of vLLM. I'm concerned that this change might be too aggressive.
| @@ -326,7 +326,7 @@ def should_advance(self, request: Request) -> bool: | |||
| return True | |||
|
|
|||
| # Check if reasoning ends in *this* step | |||
| if self.reasoner.is_reasoning_end(request.all_token_ids): | |||
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): | |||
There was a problem hiding this comment.
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): | |
| if self.reasoner.is_reasoning_end(request.all_token_ids, setp="decode"): |
WDYT?
There was a problem hiding this comment.
Thanks for the suggestion ! If plugins can use this interface outside, I agree that adding a new method breaks them.
I have hesitating with your approach initially.
Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?
On the engine core side, we initiate the ReasoningParser -> get_structured_output_reasoning_checker() with a retro-compatibility check if the function is not implemented by the external plugins.
I will update my PR with your suggested changes.
There was a problem hiding this comment.
Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?
Yes, I completely agree with your point. In fact, I have previously tried to optimize is_reasoning_end, but I encountered a difficult issue: as you mentioned, is_reasoning_end is used in multiple places, especially in the frontend + streaming scenario. See #25735 (comment).
Given that is_reasoning_end now has such a significant impact on performance, I believe it's time to address this issue.
As I just mentioned, I'm concerned that overly aggressive changes could break downstream dependencies. I hope we can find a more backward-compatible way to implement the modifications.
There was a problem hiding this comment.
Thanks for your answer, this design suggestion is more for a future version but indeed it will be breaking.
However if I add kwargs argument to the is_reasoning_end, external plugins will still be broken no ?
Otherway, we do a default implementation:
@abstractmethod
def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
"""
Check if the reasoning content ends in the input_ids on a
decode step.
It is used in structured engines like `xgrammar` to check if the
reasoning content ends in the model output before applying the
structured output.
Notes:
- The first time the reasoning content ends during a decode step, this
method returns True. StructuredOutputManager then caches the result.
- Subsequent decode steps for the same reasoning segment can return
False or True.
Parameters:
input_ids: list[int]
The input_ids of the model output at the current decode step.
Returns:
bool
True if the reasoning content ends in the input_ids on a
decode step.
"""
return self.is_reasoning_end(input_ids)
WDYT ?
| @@ -326,7 +326,7 @@ def should_advance(self, request: Request) -> bool: | |||
| return True | |||
|
|
|||
| # Check if reasoning ends in *this* step | |||
| if self.reasoner.is_reasoning_end(request.all_token_ids): | |||
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): | |||
There was a problem hiding this comment.
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): | |
| if self.reasoner.is_reasoning_end(request.all_token_ids[request.num_computed_tokens:]): |
There was a problem hiding this comment.
@hdlj-h 🤔,Perhaps this should be handled this way. WDYT?
There was a problem hiding this comment.
We only need to compute the newly generated tokens, not the tokens that have already been computed.
There was a problem hiding this comment.
Indeed, that is also an option, but some reasoning parsers rely on multi-token end markers, so the assumption that all parsers are single-token–based does not always hold.
For example:
• OLMo3 operates in string space rather than token space. It may not be compatible with structured output for this reason.
• Granite also works in string space and does not implement is_reasoning_end, meaning it could crash if used with structured output.
• Hunyuan maintains an internal state machine for reasoning, which likely makes it incompatible with structured output + reasoning.
num_computed_tokens can be greater than 1 in the V1 ? or is it the output tokens counter ? If it is the case my PR is not bullet proof because I take the assumption that a decoding step == 1 generated token.
There was a problem hiding this comment.
num_computed_tokens can be greater than 1 in the V1 ?
Yes.
There was a problem hiding this comment.
OLMo3, Granite, and Hunyuan — I think we can set these aside for now. They are inherently incompatible with structured output to begin with.
Therefore, we should ignore them for the time being and optimize them separately later.
There was a problem hiding this comment.
num_computed_tokens can be greater than 1 in the V1 ?
Yes.
Ah I didn't take in account the speculative decoding case. So my PR is not compatible in such situation when num_computed_tokens > 1.
So I will revert my change and apply your approach instead that is a smaller change.
There was a problem hiding this comment.
@chaunceyjiang num_computed_tokens, I don't find the related docs, is it the number of computed tokens before the step or is it the number of computed tokens during the step ?
chaunceyjiang
left a comment
There was a problem hiding this comment.
Thanks~
Can you help test the benchmark again?
5a6d588 to
7192b99
Compare
|
Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/ |
bc56849 to
c7d58f4
Compare
|
@chaunceyjiang I have benchmarked the new approach but I have observed a 10% slow down in decoding throughput if we support interleaved reasoning/content in the model output. But this is not really supported by the structured output manager so I have changed the implementation to only check the end_token_id in the delta. I am building the new image and keep you posted with the new benchmark results |
264a4dc to
d876ed3
Compare
|
Hi @hdlj-h, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
d876ed3 to
ae851f4
Compare
Hi @hdlj-h any updates? |
I am currently running the benchmarks with holo2 8b and holo2 30b-a3b on H100. I will update the PR when it is completed |
a9cc38c to
ae851f4
Compare
|
@chaunceyjiang I have updated the PR description with the latest benchmark numbers for Holo2 8B, Holo2 30B A3B. I also included a breakdown comparing your PR #30033 with mine. By fixing the interleaved reasoning tokens, vLLM was already faster, since many chat templates end with "<think>" so the full prompt no longer needs to be checked to determine the end of reasoning, which has a significant impact on throughput. I was unable to reproduce my initial results with Holo2 because the “fast” implementation wasn’t used due to parser composition. I fixed this in my latest commit and applied the same approach to DeepSeekV3. Now, the initial results match those obtained with the new delta-ids method. |
…endings. Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
…oding step Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
fae77a4 to
791a038
Compare
chaunceyjiang
left a comment
There was a problem hiding this comment.
LGTM.
I sincerely thank you for the effort and time you have put into this. @hdlj-h
Signed-off-by: hdlj-h <hubert@hcompany.ai>
…s using single-token reasoning endings. (vllm-project#30056) Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
This PR introduces
is_reasoning_end_streamingto theReasoningParserinterface and implements it in all reasoning parsers and uses it in the StructuredOuputManager.Previously,
is_reasoning_endchecked for the reasoning end token at every decode step by scanningall
input_idsfor models that relies on a single token for end reasoning.This was called at every decoding step in the StructuredOutputManager and could be inefficient for models that end reasoning with a single token (long reasoning + structured output)
The new method checks only the last token of the current decoding step (BaseThinkingReasoningParser)
For multi tokens end reasoning (like gptoss), this PRs preserve the previous behaviour by calling
is_reasoning_endand doesn't bring improvements.Additional benefits from 30033
During the implementation of this PR, [#30033]Fix the issue with interleaved thinking when using streaming was merged, which already brought a significant improvement in output token throughput.
Even though is_reasoning_end still checks the full input, handling interleaved reasoning requires returning False if the last reasoning token is <think>. Many chat templates end with <think>, which is why #30033 improved the previous behavior by checking only the model output instead of the full prompt.
Test Plan
tests/reasoning/test_base_thinking_reasoning_parser.pyextra-bodyto actvate the structured output manager and add a logit bias on to force the model to only think.Test Result
Holo2 8B bench result on H100
main before this PR: Fix the issue with interleaved thinking when using streaming
[main] After the fix in the interleaved reasoning
This PR: (Checking only the delta)
Holo2 30B A3B bench result on H100
main before this PR: Fix the issue with interleaved thinking when using streaming
[main] After the fix in the interleaved reasoning
This PR: (Checking only the delta)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.