[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. by hdlj-h · Pull Request #30056 · vllm-project/vllm

hdlj-h · 2025-12-04T11:28:07Z

Purpose

This PR introduces is_reasoning_end_streaming to the ReasoningParser interface and implements it in all reasoning parsers and uses it in the StructuredOuputManager.

Previously, is_reasoning_end checked for the reasoning end token at every decode step by scanning
all input_ids for models that relies on a single token for end reasoning.

def is_reasoning_end(self, input_ids: list[int]) -> bool:
    end_token_id = self.end_token_id
    return any(input_id == end_token_id for input_id in reversed(input_ids))

This was called at every decoding step in the StructuredOutputManager and could be inefficient for models that end reasoning with a single token (long reasoning + structured output)

The new method checks only the last token of the current decoding step (BaseThinkingReasoningParser)

def is_reasoning_end_streaming(
        self, input_ids: list[int], delta_ids: list[int]
    ) -> bool:
        # We only check the end token beause interleaved reasoning and content is not
        # compatible with the current structured output manager.
        end_token_id = self.end_token_id
        return end_token_id in delta_ids

For multi tokens end reasoning (like gptoss), this PRs preserve the previous behaviour by calling is_reasoning_end and doesn't bring improvements.

Additional benefits from 30033

During the implementation of this PR, [#30033]Fix the issue with interleaved thinking when using streaming was merged, which already brought a significant improvement in output token throughput.

Even though is_reasoning_end still checks the full input, handling interleaved reasoning requires returning False if the last reasoning token is <think>. Many chat templates end with <think>, which is why #30033 improved the previous behavior by checking only the model output instead of the full prompt.

Test Plan

Add test for this new method interface in tests/reasoning/test_base_thinking_reasoning_parser.py

pytest tests/reasoning/test_base_thinking_reasoning_parser.py

Benchmark vLLM on Holo2 8B with and without the change on a scenario with long reasoning and structured output. We use extra-body to actvate the structured output manager and add a logit bias on to force the model to only think.

MODEL=holo2-8b vllm bench serve \
    --dataset-name 'random' \
    --backend 'openai-chat' \
    --tokenizer 'Qwen/Qwen2-7B-Instruct' \
    --endpoint '/v1/chat/completions' \
    --ignore-eos \
    --metric-percentiles '25,50,75,90,95,99' \
    --percentile-metrics 'ttft,tpot,itl,e2el' \
    --goodput 'e2el:10000' \
    --save-detailed \
    --save-result \
    --extra-body '{ "guided_json": { "properties": {"city": {"title": "City", "type": "string"}}, "required": ["city"], "title": "Output", "type": "object" }, "logit_bias": { "151668": -100 } }' \
    --model $MODEL \
    --base-url http://$MODEL:8000 \
    --seed '1764348242' \
    --request-rate 'inf' \
    --random-input-len '7000' \
    --random-output-len '3000' \
    --max-concurrency '20' \
    --num-prompts '100'

Test Result

BaseThinkingReasoningParser tests are passing

================================================================================== test session starts ===================================================================================
platform darwin -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/hubert/code/hdlj-vllm
configfile: pyproject.toml
plugins: anyio-4.11.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 22 items                                                                                                                                                                       

tests/reasoning/test_base_thinking_reasoning_parser.py ......................

Holo2 8B bench result on H100

main before this PR: Fix the issue with interleaved thinking when using streaming

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  521.35
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.19
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         575.43
Peak output token throughput (tok/s):    700.00
Peak concurrent requests:                26.00
Total Token throughput (tok/s):          1918.09
---------------Time to First Token----------------
Mean TTFT (ms):                          1562.85
Median TTFT (ms):                        1099.16
P25 TTFT (ms):                           1062.06
P50 TTFT (ms):                           1099.16
P75 TTFT (ms):                           1379.54
P90 TTFT (ms):                           3577.44
P95 TTFT (ms):                           4618.21
P99 TTFT (ms):                           5682.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.23
Median TPOT (ms):                        34.34
P25 TPOT (ms):                           34.04
P50 TPOT (ms):                           34.34
P75 TPOT (ms):                           34.45
P90 TPOT (ms):                           34.50
P95 TPOT (ms):                           34.59
P99 TPOT (ms):                           34.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.25
Median ITL (ms):                         33.13
P25 ITL (ms):                            30.91
P50 ITL (ms):                            33.13
P75 ITL (ms):                            35.33
P90 ITL (ms):                            36.58
P95 ITL (ms):                            37.27
P99 ITL (ms):                            42.25
----------------End-to-end Latency----------------
Mean E2EL (ms):                          104217.78
Median E2EL (ms):                        104375.43
P25 E2EL (ms):                           104054.62
P50 E2EL (ms):                           104375.43
P75 E2EL (ms):                           104696.93
P90 E2EL (ms):                           105622.35
P95 E2EL (ms):                           106714.51
P99 E2EL (ms):                           107775.61
==================================================

[main] After the fix in the interleaved reasoning

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  315.18
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.32
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         951.83
Peak output token throughput (tok/s):    1280.00
Peak concurrent requests:                26.00
Total Token throughput (tok/s):          3172.77
---------------Time to First Token----------------
Mean TTFT (ms):                          940.90
Median TTFT (ms):                        703.91
P50 TTFT (ms):                           703.91
P75 TTFT (ms):                           720.16
P95 TTFT (ms):                           2910.81
P99 TTFT (ms):                           3820.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.70
Median TPOT (ms):                        20.76
P50 TPOT (ms):                           20.76
P75 TPOT (ms):                           20.86
P95 TPOT (ms):                           20.97
P99 TPOT (ms):                           21.10
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.70
Median ITL (ms):                         19.72
P50 ITL (ms):                            19.72
P75 ITL (ms):                            21.90
P95 ITL (ms):                            23.73
P99 ITL (ms):                            24.81
----------------End-to-end Latency----------------
Mean E2EL (ms):                          63011.14
Median E2EL (ms):                        63034.41
P50 E2EL (ms):                           63034.41
P75 E2EL (ms):                           63570.68
P95 E2EL (ms):                           65097.66
P99 E2EL (ms):                           66037.71
==================================================

This PR: (Checking only the delta)

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  277.61
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.36
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         1080.66
Peak output token throughput (tok/s):    1280.00
Peak concurrent requests:                27.00
Total Token throughput (tok/s):          3602.20
---------------Time to First Token----------------
Mean TTFT (ms):                          919.02
Median TTFT (ms):                        691.28
P50 TTFT (ms):                           691.28
P75 TTFT (ms):                           702.26
P95 TTFT (ms):                           2860.22
P99 TTFT (ms):                           3766.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.20
Median TPOT (ms):                        18.27
P50 TPOT (ms):                           18.27
P75 TPOT (ms):                           18.31
P95 TPOT (ms):                           18.41
P99 TPOT (ms):                           18.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.30
Median ITL (ms):                         17.28
P50 ITL (ms):                            17.28
P75 ITL (ms):                            18.13
P95 ITL (ms):                            18.91
P99 ITL (ms):                            32.93
----------------End-to-end Latency----------------
Mean E2EL (ms):                          55499.46
Median E2EL (ms):                        55505.07
P50 E2EL (ms):                           55505.07
P75 E2EL (ms):                           55611.81
P95 E2EL (ms):                           57556.07
P99 E2EL (ms):                           58462.71
==================================================

Holo2 30B A3B bench result on H100

main before this PR: Fix the issue with interleaved thinking when using streaming

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  399.75
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.25
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         750.47
Peak output token throughput (tok/s):    900.00
Peak concurrent requests:                33.00
Total Token throughput (tok/s):          2501.57
---------------Time to First Token----------------
Mean TTFT (ms):                          504.10
Median TTFT (ms):                        362.88
P50 TTFT (ms):                           362.88
P75 TTFT (ms):                           391.95
P95 TTFT (ms):                           1434.96
P99 TTFT (ms):                           1797.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.48
Median TPOT (ms):                        26.56
P50 TPOT (ms):                           26.56
P75 TPOT (ms):                           26.60
P95 TPOT (ms):                           26.96
P99 TPOT (ms):                           26.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           27.50
Median ITL (ms):                         26.06
P50 ITL (ms):                            26.06
P75 ITL (ms):                            27.68
P95 ITL (ms):                            30.50
P99 ITL (ms):                            56.67
----------------End-to-end Latency----------------
Mean E2EL (ms):                          79921.00
Median E2EL (ms):                        80023.61
P50 E2EL (ms):                           80023.61
P75 E2EL (ms):                           80104.15
P95 E2EL (ms):                           82260.51
P99 E2EL (ms):                           82625.40
==================================================

[main] After the fix in the interleaved reasoning

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  202.75
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.49
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         1479.67
Peak output token throughput (tok/s):    2039.00
Peak concurrent requests:                37.00
Total Token throughput (tok/s):          4932.22
---------------Time to First Token----------------
Mean TTFT (ms):                          489.01
Median TTFT (ms):                        341.32
P50 TTFT (ms):                           341.32
P75 TTFT (ms):                           428.19
P95 TTFT (ms):                           1345.94
P99 TTFT (ms):                           1581.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.35
Median TPOT (ms):                        13.45
P50 TPOT (ms):                           13.45
P75 TPOT (ms):                           13.58
P95 TPOT (ms):                           13.73
P99 TPOT (ms):                           13.76
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.69
Median ITL (ms):                         12.98
P50 ITL (ms):                            12.98
P75 ITL (ms):                            14.31
P95 ITL (ms):                            16.06
P99 ITL (ms):                            28.01
----------------End-to-end Latency----------------
Mean E2EL (ms):                          40531.71
Median E2EL (ms):                        40744.67
P50 E2EL (ms):                           40744.67
P75 E2EL (ms):                           41468.42
P95 E2EL (ms):                           41804.52
P99 E2EL (ms):                           42048.05
==================================================

This PR: (Checking only the delta)

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  161.63
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.62
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         1856.05
Peak output token throughput (tok/s):    2134.00
Peak concurrent requests:                37.00
Total Token throughput (tok/s):          6186.84
---------------Time to First Token----------------
Mean TTFT (ms):                          532.56
Median TTFT (ms):                        411.96
P50 TTFT (ms):                           411.96
P75 TTFT (ms):                           432.54
P95 TTFT (ms):                           1402.83
P99 TTFT (ms):                           1722.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.60
Median TPOT (ms):                        10.49
P50 TPOT (ms):                           10.49
P75 TPOT (ms):                           10.67
P95 TPOT (ms):                           11.05
P99 TPOT (ms):                           11.11
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.22
Median ITL (ms):                         10.15
P50 ITL (ms):                            10.15
P75 ITL (ms):                            10.46
P95 ITL (ms):                            20.04
P99 ITL (ms):                            20.83
----------------End-to-end Latency----------------
Mean E2EL (ms):                          32310.77
Median E2EL (ms):                        31802.08
P50 E2EL (ms):                           31802.08
P75 E2EL (ms):                           32415.42
P95 E2EL (ms):                           34495.17
P99 E2EL (ms):                           34809.45
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-12-04T11:28:46Z

Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/

gemini-code-assist

Code Review

This pull request introduces a significant performance improvement for structured output with single-token reasoning endings by adding the is_reasoning_end_on_decode_step method. The changes are well-structured, with updates to the core interface, implementations for various parsers, corresponding tests, and documentation. The benchmark results clearly demonstrate the effectiveness of the optimization. My main feedback is a critical performance issue in the olmo3_reasoning_parser where a major optimization opportunity was missed, which I've detailed in a specific comment.

gemini-code-assist · 2025-12-04T11:30:55Z

vllm/reasoning/olmo3_reasoning_parser.py

@@ -243,6 +243,9 @@ def is_reasoning_end(self, input_ids: list[int]) -> bool:
        text = self.model_tokenizer.decode(input_ids)
        return self.think_end in text

+    def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
+        return self.is_reasoning_end(input_ids)


Calling self.is_reasoning_end() here is highly inefficient as it decodes the entire input_ids sequence on every step. This negates the performance benefits of this PR for olmo3 models.

A much more performant implementation should be used that avoids decoding the full sequence by checking only a suffix of the input_ids. Since this method is called at each step to detect the first appearance of the end marker, checking a suffix is a safe and efficient heuristic.

Suggested change

return self.is_reasoning_end(input_ids)

# Avoid decoding the whole sequence by checking only a suffix.

suffix_len = 32

if len(input_ids) < suffix_len:

text_to_check = self.model_tokenizer.decode(input_ids)

else:

text_to_check = self.model_tokenizer.decode(input_ids[-suffix_len:])

return self.think_end in text_to_check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/structured_output/__init__.py

chaunceyjiang · 2025-12-04T12:39:20Z

Total Token throughput (tok/s): 1918.09
Total Token throughput (tok/s): 3431.06

Amazing! is_reasoning_end is a known issue, but I didn't expect it to have such a huge impact!!!

chaunceyjiang · 2025-12-04T12:42:10Z

docs/features/reasoning_outputs.md

@@ -297,6 +297,9 @@ Additionally, to enable structured output, you'll need to create a new `Reasoner

        def is_reasoning_end(self, input_ids: list[int]) -> bool:
            return self.end_token_id in input_ids
+
+        def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:


I suggest adding a **kwargs parameter to is_reasoning_end. What do you think?

It's currently known that there are some reasoning_parser plugins outside of vLLM. I'm concerned that this change might be too aggressive.

chaunceyjiang · 2025-12-04T12:43:30Z

vllm/v1/structured_output/__init__.py

@@ -326,7 +326,7 @@ def should_advance(self, request: Request) -> bool:
            return True

        # Check if reasoning ends in *this* step
-        if self.reasoner.is_reasoning_end(request.all_token_ids):
+        if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):


Suggested change

if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):

if self.reasoner.is_reasoning_end(request.all_token_ids, setp="decode"):

WDYT?

Thanks for the suggestion ! If plugins can use this interface outside, I agree that adding a new method breaks them.
I have hesitating with your approach initially.

Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?

On the engine core side, we initiate the ReasoningParser -> get_structured_output_reasoning_checker() with a retro-compatibility check if the function is not implemented by the external plugins.

I will update my PR with your suggested changes.

Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?

Yes, I completely agree with your point. In fact, I have previously tried to optimize is_reasoning_end, but I encountered a difficult issue: as you mentioned, is_reasoning_end is used in multiple places, especially in the frontend + streaming scenario. See #25735 (comment).

Given that is_reasoning_end now has such a significant impact on performance, I believe it's time to address this issue.

As I just mentioned, I'm concerned that overly aggressive changes could break downstream dependencies. I hope we can find a more backward-compatible way to implement the modifications.

Thanks for your answer, this design suggestion is more for a future version but indeed it will be breaking.

However if I add kwargs argument to the is_reasoning_end, external plugins will still be broken no ?

Otherway, we do a default implementation:

@abstractmethod def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool: """ Check if the reasoning content ends in the input_ids on a decode step. It is used in structured engines like `xgrammar` to check if the reasoning content ends in the model output before applying the structured output. Notes: - The first time the reasoning content ends during a decode step, this method returns True. StructuredOutputManager then caches the result. - Subsequent decode steps for the same reasoning segment can return False or True. Parameters: input_ids: list[int] The input_ids of the model output at the current decode step. Returns: bool True if the reasoning content ends in the input_ids on a decode step. """ return self.is_reasoning_end(input_ids)

WDYT ?

chaunceyjiang · 2025-12-04T15:16:38Z

vllm/v1/structured_output/__init__.py

@@ -326,7 +326,7 @@ def should_advance(self, request: Request) -> bool:
            return True

        # Check if reasoning ends in *this* step
-        if self.reasoner.is_reasoning_end(request.all_token_ids):
+        if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):


Suggested change

if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):

if self.reasoner.is_reasoning_end(request.all_token_ids[request.num_computed_tokens:]):

@hdlj-h 🤔，Perhaps this should be handled this way. WDYT?

We only need to compute the newly generated tokens, not the tokens that have already been computed.

Indeed, that is also an option, but some reasoning parsers rely on multi-token end markers, so the assumption that all parsers are single-token–based does not always hold.

For example:
• OLMo3 operates in string space rather than token space. It may not be compatible with structured output for this reason.
• Granite also works in string space and does not implement is_reasoning_end, meaning it could crash if used with structured output.
• Hunyuan maintains an internal state machine for reasoning, which likely makes it incompatible with structured output + reasoning.

num_computed_tokens can be greater than 1 in the V1 ? or is it the output tokens counter ? If it is the case my PR is not bullet proof because I take the assumption that a decoding step == 1 generated token.

num_computed_tokens can be greater than 1 in the V1 ?

Yes.

OLMo3, Granite, and Hunyuan — I think we can set these aside for now. They are inherently incompatible with structured output to begin with.

Therefore, we should ignore them for the time being and optimize them separately later.

num_computed_tokens can be greater than 1 in the V1 ?

Yes.

Ah I didn't take in account the speculative decoding case. So my PR is not compatible in such situation when num_computed_tokens > 1.

So I will revert my change and apply your approach instead that is a smaller change.

@chaunceyjiang num_computed_tokens, I don't find the related docs, is it the number of computed tokens before the step or is it the number of computed tokens during the step ?

chaunceyjiang

Thanks~

Can you help test the benchmark again?

vllm/v1/structured_output/__init__.py

mergify · 2025-12-05T10:52:28Z

Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/

hdlj-h · 2025-12-05T16:48:14Z

@chaunceyjiang I have benchmarked the new approach but I have observed a 10% slow down in decoding throughput if we support interleaved reasoning/content in the model output. But this is not really supported by the structured output manager so I have changed the implementation to only check the end_token_id in the delta. I am building the new image and keep you posted with the new benchmark results

mergify · 2025-12-05T16:56:29Z

Hi @hdlj-h, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

chaunceyjiang · 2025-12-08T08:46:19Z

I am building the new image and keep you posted with the new benchmark results

Hi @hdlj-h any updates?

hdlj-h · 2025-12-08T11:54:51Z

I am building the new image and keep you posted with the new benchmark results

Hi @hdlj-h any updates?

I am currently running the benchmarks with holo2 8b and holo2 30b-a3b on H100. I will update the PR when it is completed

hdlj-h · 2025-12-08T17:09:55Z

@chaunceyjiang I have updated the PR description with the latest benchmark numbers for Holo2 8B, Holo2 30B A3B. I also included a breakdown comparing your PR #30033 with mine. By fixing the interleaved reasoning tokens, vLLM was already faster, since many chat templates end with "<think>" so the full prompt no longer needs to be checked to determine the end of reasoning, which has a significant impact on throughput.

I was unable to reproduce my initial results with Holo2 because the “fast” implementation wasn’t used due to parser composition. I fixed this in my latest commit and applied the same approach to DeepSeekV3. Now, the initial results match those obtained with the new delta-ids method.

…endings. Signed-off-by: hdlj-h <hubert@hcompany.ai>

Signed-off-by: hdlj-h <hubert@hcompany.ai>

…oding step Signed-off-by: hdlj-h <hubert@hcompany.ai>

Signed-off-by: hdlj-h <hubert@hcompany.ai>

chaunceyjiang

LGTM.

I sincerely thank you for the effort and time you have put into this. @hdlj-h

vllm/reasoning/basic_parsers.py

Signed-off-by: hdlj-h <hubert@hcompany.ai>

chaunceyjiang

Thanks~ @hdlj-h

…s using single-token reasoning endings. (vllm-project#30056) Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

hdlj-h requested review from aarnphm, benchislett, chaunceyjiang, mgoin and russellb as code owners December 4, 2025 11:28

mergify bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models structured-output labels Dec 4, 2025

github-project-automation bot added this to Structured Output Dec 4, 2025

mergify bot added the v1 label Dec 4, 2025

hdlj-h force-pushed the add-decode-step-end-reasoning branch from 49dcbe7 to cfd844c Compare December 4, 2025 11:29

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 4, 2025

View reviewed changes

vllm/v1/structured_output/__init__.py Show resolved Hide resolved

chaunceyjiang self-assigned this Dec 4, 2025

chaunceyjiang reviewed Dec 4, 2025

View reviewed changes

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025

hdlj-h force-pushed the add-decode-step-end-reasoning branch from 5a6d588 to 7192b99 Compare December 4, 2025 16:31

chaunceyjiang reviewed Dec 5, 2025

View reviewed changes

vllm/v1/structured_output/__init__.py Show resolved Hide resolved

hdlj-h force-pushed the add-decode-step-end-reasoning branch from bc56849 to c7d58f4 Compare December 5, 2025 13:45

hdlj-h force-pushed the add-decode-step-end-reasoning branch from 264a4dc to d876ed3 Compare December 5, 2025 16:52

hdlj-h force-pushed the add-decode-step-end-reasoning branch from d876ed3 to ae851f4 Compare December 5, 2025 17:24

hdlj-h force-pushed the add-decode-step-end-reasoning branch from a9cc38c to ae851f4 Compare December 8, 2025 14:34

hdlj-h added 8 commits December 8, 2025 19:41

Improves decoding throughput for models using single-token reasoning …

a99813d

…endings. Signed-off-by: hdlj-h <hubert@hcompany.ai>

Add new logic to Minimax_m2 reasoning parser

5a8b37d

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Use default implementation for new interface

a8fd8da

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Simplify PR change and support for multi token generations during dec…

ddd16eb

…oding step Signed-off-by: hdlj-h <hubert@hcompany.ai>

Fix test

18f1daf

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Add interface with is_reasoning_end_streaming for reasoning parsers

eb23402

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Check only end_token_id during streaming

f973d0a

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Add reasoning_end_streaming to holo2 and deepseek v3 parsers

791a038

Signed-off-by: hdlj-h <hubert@hcompany.ai>

hdlj-h force-pushed the add-decode-step-end-reasoning branch from fae77a4 to 791a038 Compare December 8, 2025 18:41

chaunceyjiang reviewed Dec 9, 2025

View reviewed changes

vllm/reasoning/basic_parsers.py Outdated Show resolved Hide resolved

Remove comment

2b1b913

Signed-off-by: hdlj-h <hubert@hcompany.ai>

chaunceyjiang enabled auto-merge (squash) December 9, 2025 09:13

chaunceyjiang approved these changes Dec 9, 2025

View reviewed changes

chaunceyjiang merged commit c72ea10 into vllm-project:main Dec 9, 2025
45 checks passed

github-project-automation bot moved this to Done in Structured Output Dec 9, 2025

hdlj-h deleted the add-decode-step-end-reasoning branch December 9, 2025 12:47

chaunceyjiang mentioned this pull request Jan 8, 2026

[Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming #31969

Merged

5 tasks

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Structured Output][Reasoning] Improves decoding throughput for model…

e9cab59

…s using single-token reasoning endings. (vllm-project#30056) Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

fergusfinn mentioned this pull request Mar 2, 2026

[Performance] Add is_reasoning_end_streaming() override to GptOssReasoningParser #35745

Open

fergusfinn mentioned this pull request Mar 23, 2026

GPT-OSS structured output + reasoning grinds to a halt at long context #37897

Open

-        return self.is_reasoning_end(input_ids)
+        # Avoid decoding the whole sequence by checking only a suffix.
+        suffix_len = 32
+        if len(input_ids) < suffix_len:
+            text_to_check = self.model_tokenizer.decode(input_ids)
+        else:
+            text_to_check = self.model_tokenizer.decode(input_ids[-suffix_len:])
+        return self.think_end in text_to_check

	if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):
	if self.reasoner.is_reasoning_end(request.all_token_ids, setp="decode"):

Uh oh!

Conversation

hdlj-h commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Additional benefits from 30033

Test Plan

Test Result

Holo2 8B bench result on H100

Holo2 30B A3B bench result on H100

Uh oh!

mergify bot commented Dec 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chaunceyjiang commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hdlj-h Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hdlj-h Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

hdlj-h commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

chaunceyjiang commented Dec 8, 2025

Uh oh!

hdlj-h commented Dec 8, 2025

Uh oh!

hdlj-h commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

hdlj-h commented Dec 4, 2025 •

edited by github-actions bot

Loading

chaunceyjiang Dec 4, 2025 •

edited

Loading

hdlj-h Dec 4, 2025 •

edited

Loading

hdlj-h Dec 4, 2025 •

edited

Loading

hdlj-h commented Dec 5, 2025 •

edited

Loading

hdlj-h commented Dec 8, 2025 •

edited

Loading