Skip to content

[Core] Avoid using extra thread in UniProcExecutor#40891

Merged
zhuohan123 merged 1 commit intovllm-project:mainfrom
njhill:uniproc-thread
May 7, 2026
Merged

[Core] Avoid using extra thread in UniProcExecutor#40891
zhuohan123 merged 1 commit intovllm-project:mainfrom
njhill:uniproc-thread

Conversation

@njhill
Copy link
Copy Markdown
Member

@njhill njhill commented Apr 26, 2026

@zhuohan123's idea

Clear performance benefit, especially in low latency / high concurrency case.

Benchmark

--tensor-parallel-size 1 --distributed-executor-backend uni on a single NVIDIA GB200 GPU.
Model: Qwen/Qwen3-0.6B. Each side is mean ± population std across 3 timed runs sharing one server process; each run uses its own seed (1, 2, 3) and is preceded by a fresh warmup batch.
Δ = relative change of with-mean vs. without-mean (✓ = improvement, ✗ = regression).
256 in / 2048 out; ignore-eos, no prefix cache.

Metric Concurrency 1 (32 prompts) Concurrency 512 (1024 prompts)
Without With Δ Without With Δ
Output throughput (tok/s) 631.21 ±0.64 648.73 ±14.60 +2.78% ✓ 17221.96 ±168.39 18974.27 ±176.49 +10.17% ✓
P50 TTFT (ms) 19.00 ±0.16 19.21 ±0.19 +1.10% ✗ 910.12 ±36.45 851.30 ±69.16 -6.46% ✓
P90 TTFT (ms) 19.48 ±0.27 19.88 ±0.32 +2.05% ✗ 1309.86 ±49.79 1113.85 ±85.18 -14.96% ✓
P50 TPOT (ms) 1.572 ±0.001 1.530 ±0.032 -2.66% ✓ 28.946 ±0.199 26.383 ±0.352 -8.85% ✓
P90 TPOT (ms) 1.588 ±0.005 1.558 ±0.033 -1.88% ✓ 29.512 ±0.214 26.935 ±0.070 -8.73% ✓
Mean ITL (ms) 1.576 ±0.002 1.534 ±0.035 -2.68% ✓ 29.112 ±0.099 26.563 ±0.243 -8.75% ✓
Mean E2EL (ms) 3244.52 ±3.29 3158.44 ±71.05 -2.65% ✓ 60292.61 ±232.44 55010.05 ±510.73 -8.76% ✓

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the v1 label Apr 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the ThreadPoolExecutor in UniProcExecutor with a custom AsyncOutputFuture class to manage asynchronous model outputs. Review feedback identifies a thread-safety issue in the AsyncOutputFuture.result method that could cause multiple calls to get_output(), violating its contract. Suggestions include implementing a double-checked locking pattern, using NotImplementedError for the timeout parameter, and adding the necessary threading import.

Comment thread vllm/v1/executor/uniproc_executor.py
Comment thread vllm/v1/executor/uniproc_executor.py
@njhill njhill force-pushed the uniproc-thread branch from a24dd60 to 364f54a Compare May 6, 2026 19:52
@zhuohan123's idea

Signed-off-by: Nick Hill <nickhill123@gmail.com>
@njhill njhill force-pushed the uniproc-thread branch from 364f54a to 5b3d457 Compare May 7, 2026 02:22
@njhill njhill added ready ONLY add when PR is ready to merge/full CI is needed labels May 7, 2026
@njhill njhill requested review from zhuohan123 May 7, 2026 02:25
@zhuohan123 zhuohan123 merged commit 10ebb40 into vllm-project:main May 7, 2026
49 of 52 checks passed
whytem pushed a commit to whytem/vllm that referenced this pull request May 8, 2026
)

Signed-off-by: Nick Hill <nickhill123@gmail.com>
libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026
)

Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants