Skip to content

[Frontend] Improve the performance of is_reasoning_end#25735

Merged
DarkLight1337 merged 8 commits intovllm-project:mainfrom
chaunceyjiang:is_reasoning_end
Oct 11, 2025
Merged

[Frontend] Improve the performance of is_reasoning_end#25735
DarkLight1337 merged 8 commits intovllm-project:mainfrom
chaunceyjiang:is_reasoning_end

Conversation

@chaunceyjiang
Copy link
Collaborator

@chaunceyjiang chaunceyjiang commented Sep 26, 2025

Purpose

is_reasoning_end is executed once for every token generated during the reasoning phase, which results in poor performance.
Optimize it to perform incremental checks instead.

Currently, is_reasoning_end cannot support incremental checking for the following reason:

When stream=true, is_reasoning_end checks output.token_ids and res.prompt_token_ids separately. This leads to non-deterministic inputs to is_reasoning_end within the same request, making incremental checking impossible.

Therefore, the only feasible optimization at present is to modify is_reasoning_end to search backward for the end_token from the end of the token sequence.

Test Plan

vllm serve /home/jovyan/qwen3-8b  --reasoning-parser qwen3 --guided-decoding-backend xgrammar --enable-auto-tool-choice --tool-call-parser hermes --no-enable-prefix-caching
vllm bench serve \   
  --backend vllm \
  --model /home/jovyan/qwen3-8b \
  --served-model-name /home/jovyan/qwen3-8b \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

main

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  95.70     
Total input tokens:                      204800    
Total generated tokens:                  96051     
Request throughput (req/s):              1.04      
Output token throughput (tok/s):         1003.69   
Peak output token throughput (tok/s):    1150.00   
Peak concurrent requests:                20.00     
Total Token throughput (tok/s):          3143.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          311.72    
Median TTFT (ms):                        282.02    
P99 TTFT (ms):                           525.12    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.56      
Median TPOT (ms):                        9.17      
P99 TPOT (ms):                           11.46     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.13      
Median ITL (ms):                         8.88      
P99 ITL (ms):                            10.55     
==================================================

this pr

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  95.65     
Total input tokens:                      204800    
Total generated tokens:                  96316     
Request throughput (req/s):              1.05      
Output token throughput (tok/s):         1006.96   
Peak output token throughput (tok/s):    1150.00   
Peak concurrent requests:                20.00     
Total Token throughput (tok/s):          3148.09   
---------------Time to First Token----------------
Mean TTFT (ms):                          316.62    
Median TTFT (ms):                        284.75    
P99 TTFT (ms):                           529.61    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.57      
Median TPOT (ms):                        9.12      
P99 TPOT (ms):                           11.55     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.11      
Median ITL (ms):                         8.87      
P99 ITL (ms):                            10.53     
==================================================

Test Result

The performance improvement is very marginal.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@chaunceyjiang chaunceyjiang marked this pull request as ready for review October 9, 2025 02:31
@chaunceyjiang chaunceyjiang requested a review from aarnphm as a code owner October 9, 2025 02:31
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@mergify
Copy link

mergify bot commented Oct 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@chatgpt-codex-connector
Copy link

💡 Codex Review

https://github.com/vllm-project/vllm/blob/0f3384862593027cd7ea8433c9c5314536011ecc/vllm/reasoning/basic_parsers.py#L60-L66
P0 Badge Referencing undefined token attribute in incremental end check

The new incremental is_reasoning_end logic now compares tokens against self.think_end_token_id, but BaseThinkingReasoningParser only defines start_token_id/end_token_id. Subclasses like DeepSeekR1ReasoningParser and SeedOSSReasoningParser never create a think_end_token_id, so the first call to is_reasoning_end will raise AttributeError and break reasoning parsing for those models. The loop should use the existing self.end_token_id instead.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@mergify mergify bot removed the needs-rebase label Oct 9, 2025
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang chaunceyjiang requested a review from njhill October 9, 2025 02:40
@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang
Copy link
Collaborator Author

/cc @njhill Due to the various uses of is_reasoning_end, it is not possible to perform incremental checks for the end_token. Currently, the only optimization is to search from the end backwards.

@DarkLight1337
Copy link
Member

Can you edit the PR description accordingly?

@chaunceyjiang
Copy link
Collaborator Author

@DarkLight1337 I've updated the PR description.

this optimization provides only a very small performance improvement. 😂

@DarkLight1337 DarkLight1337 merged commit be06786 into vllm-project:main Oct 11, 2025
46 checks passed
@chaunceyjiang chaunceyjiang deleted the is_reasoning_end branch October 11, 2025 03:55
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…ct#25735)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants