[Frontend] Improve the performance of `is_reasoning_end` by chaunceyjiang · Pull Request #25735 · vllm-project/vllm

chaunceyjiang · 2025-09-26T03:17:02Z

Purpose

~~is_reasoning_end is executed once for every token generated during the reasoning phase, which results in poor performance.~~
~~Optimize it to perform incremental checks instead.~~

Currently, is_reasoning_end cannot support incremental checking for the following reason:

When stream=true, is_reasoning_end checks output.token_ids and res.prompt_token_ids separately. This leads to non-deterministic inputs to is_reasoning_end within the same request, making incremental checking impossible.

Therefore, the only feasible optimization at present is to modify is_reasoning_end to search backward for the end_token from the end of the token sequence.

Test Plan

vllm serve /home/jovyan/qwen3-8b  --reasoning-parser qwen3 --guided-decoding-backend xgrammar --enable-auto-tool-choice --tool-call-parser hermes --no-enable-prefix-caching

vllm bench serve \   
  --backend vllm \
  --model /home/jovyan/qwen3-8b \
  --served-model-name /home/jovyan/qwen3-8b \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

main

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  95.70     
Total input tokens:                      204800    
Total generated tokens:                  96051     
Request throughput (req/s):              1.04      
Output token throughput (tok/s):         1003.69   
Peak output token throughput (tok/s):    1150.00   
Peak concurrent requests:                20.00     
Total Token throughput (tok/s):          3143.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          311.72    
Median TTFT (ms):                        282.02    
P99 TTFT (ms):                           525.12    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.56      
Median TPOT (ms):                        9.17      
P99 TPOT (ms):                           11.46     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.13      
Median ITL (ms):                         8.88      
P99 ITL (ms):                            10.55     
==================================================

this pr

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  95.65     
Total input tokens:                      204800    
Total generated tokens:                  96316     
Request throughput (req/s):              1.05      
Output token throughput (tok/s):         1006.96   
Peak output token throughput (tok/s):    1150.00   
Peak concurrent requests:                20.00     
Total Token throughput (tok/s):          3148.09   
---------------Time to First Token----------------
Mean TTFT (ms):                          316.62    
Median TTFT (ms):                        284.75    
P99 TTFT (ms):                           529.61    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.57      
Median TPOT (ms):                        9.12      
P99 TPOT (ms):                           11.55     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.11      
Median ITL (ms):                         8.87      
P99 ITL (ms):                            10.53     
==================================================

Test Result

The performance improvement is very marginal.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify · 2025-10-09T02:31:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector · 2025-10-09T02:32:56Z

💡 Codex Review

https://github.com/vllm-project/vllm/blob/0f3384862593027cd7ea8433c9c5314536011ecc/vllm/reasoning/basic_parsers.py#L60-L66
Referencing undefined token attribute in incremental end check

The new incremental is_reasoning_end logic now compares tokens against self.think_end_token_id, but BaseThinkingReasoningParser only defines start_token_id/end_token_id. Subclasses like DeepSeekR1ReasoningParser and SeedOSSReasoningParser never create a think_end_token_id, so the first call to is_reasoning_end will raise AttributeError and break reasoning parsing for those models. The loop should use the existing self.end_token_id instead.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

vllm/reasoning/basic_parsers.py

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2025-10-10T16:46:17Z

/cc @njhill Due to the various uses of is_reasoning_end, it is not possible to perform incremental checks for the end_token. Currently, the only optimization is to search from the end backwards.

DarkLight1337 · 2025-10-10T16:47:47Z

Can you edit the PR description accordingly?

chaunceyjiang · 2025-10-11T02:38:50Z

@DarkLight1337 I've updated the PR description.

this optimization provides only a very small performance improvement. 😂

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang marked this pull request as ready for review October 9, 2025 02:31

chaunceyjiang requested a review from aarnphm as a code owner October 9, 2025 02:31

[Frontend] Improve the performance of

7b42fff

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify bot added the needs-rebase label Oct 9, 2025

chaunceyjiang force-pushed the is_reasoning_end branch from 0f33848 to 7b42fff Compare October 9, 2025 02:32

mergify bot removed the needs-rebase label Oct 9, 2025

chaunceyjiang added 2 commits October 9, 2025 02:34

[Frontend] Improve the performance of is_reasoning_end

0417bb4

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Frontend] Improve the performance of is_reasoning_end

afa701e

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang mentioned this pull request Oct 9, 2025

[Frontend] Skip stop in reasoning content #24941

Open

5 tasks

chaunceyjiang requested a review from njhill October 9, 2025 02:40

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025

chaunceyjiang added 2 commits October 9, 2025 09:18

[Frontend] Improve the performance of is_reasoning_end

4beeaa3

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Frontend] Improve the performance of is_reasoning_end

8e1d81d

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang requested review from DarkLight1337, NickLucche, robertgshaw2-redhat and simon-mo as code owners October 10, 2025 04:21

chaunceyjiang added 2 commits October 10, 2025 04:22

[Frontend] Improve the performance of is_reasoning_end

5bcf3c6

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Frontend] Improve the performance of is_reasoning_end

dc15907

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

DarkLight1337 reviewed Oct 10, 2025

View reviewed changes

vllm/reasoning/basic_parsers.py Outdated Show resolved Hide resolved

[Frontend] Improve the performance of is_reasoning_end

b5bd249

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

DarkLight1337 approved these changes Oct 11, 2025

View reviewed changes

DarkLight1337 merged commit be06786 into vllm-project:main Oct 11, 2025
46 checks passed

chaunceyjiang deleted the is_reasoning_end branch October 11, 2025 03:55

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025

[Frontend] Improve the performance of is_reasoning_end (vllm-proje…

88448a1

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[Frontend] Improve the performance of is_reasoning_end (vllm-proje…

47e608a

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Frontend] Improve the performance of is_reasoning_end (vllm-proje…

65b9f26

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Frontend] Improve the performance of is_reasoning_end (vllm-proje…

a0ed17a

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Frontend] Improve the performance of is_reasoning_end (vllm-proje…

7a9e29f

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Frontend] Improve the performance of is_reasoning_end (vllm-proje…

9286f99

…ct#25735) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang mentioned this pull request Dec 4, 2025

[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. #30056

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Improve the performance of `is_reasoning_end`#25735

[Frontend] Improve the performance of `is_reasoning_end`#25735
DarkLight1337 merged 8 commits intovllm-project:mainfrom
chaunceyjiang:is_reasoning_end

chaunceyjiang commented Sep 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 9, 2025

Uh oh!

Uh oh!

chaunceyjiang commented Oct 10, 2025

Uh oh!

DarkLight1337 commented Oct 10, 2025

Uh oh!

chaunceyjiang commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chaunceyjiang commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 9, 2025

💡 Codex Review

Uh oh!

Uh oh!

chaunceyjiang commented Oct 10, 2025

Uh oh!

DarkLight1337 commented Oct 10, 2025

Uh oh!

chaunceyjiang commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaunceyjiang commented Sep 26, 2025 •

edited by github-actions bot

Loading