Skip to content

[Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints.#39763

Merged
DarkLight1337 merged 5 commits intovllm-project:mainfrom
noooop:pooling_entrypoints_using_thread_pool
Apr 14, 2026
Merged

[Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints.#39763
DarkLight1337 merged 5 commits intovllm-project:mainfrom
noooop:pooling_entrypoints_using_thread_pool

Conversation

@noooop
Copy link
Copy Markdown
Collaborator

@noooop noooop commented Apr 14, 2026

Purpose

Following #34789

Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints.

rename _preprocess_completion_online ->_preprocess_cmpl_online

  • Using a thread pool barely adds any overhead.

https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_offline.py https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_online.py

This largely addresses the 2ms latency regression introduced by the asynchronous tokenizer. found in #27407

main (sync tokenizer)

online 3.5837650002576993
offline 2.173177999793552

this pr (thread pool)

online 3.5909150001316448
offline 2.1705790004489245

  • Under high concurrency:

https://github.com/noooop/snippet/blob/main/benchmarks/embed3/v1_online_high.py

image image
n_clients Throughput: this pr (thread pool) Throughput: main (sync) Throughput: this pr (thread pool) + api * 4 Throughput: main (sync) + api * 4 Throughput: main(#27407) Throughput: #27407 Throughput: async renderer Throughput: async renderer + api * 4
1 145312.5521 3.44 149284.4683 3.36 144803.5375 3.46 150063.0409 3.34 88532.0299 5.71 140679.9683 3.57 90493.2526 5.58 90592.7286 5.58
2 278668.8227 3.6 277037.4636 3.63 276058.5841 3.64 275645.3112 3.65 178379.0391 5.67 286591.4312 3.51 186279.8288 5.43 183518.5588 5.51
4 379321.3826 5.32 367733.2157 5.49 392391.8923 5.13 397400.8369 5.07 280711.5651 7.22 307997.374 6.54 267043.8983 7.59 292713.1852 6.92
8 485165.0423 8.23 471394.2763 8.5 510324.2992 7.73 518657.7842 7.61 277196.4544 14.13 433038.121 9.28 383520.4782 10.51 418086.4158 9.6
16 563471.4011 14.13 524841.5681 15.09 581337.1984 13.48 585431.7416 13.29 421348.0377 18.57 525157.8099 15.24 525778.7223 15.19 525521.3027 14.98
32 600047.0408 26.52 572158.3257 27.67 599751.1129 26.21 597702.8949 26.12 464678.7363 33.23 546190.4314 29.63 538383.3793 29.41 580194.0677 26.97
64 619535.4827 51.3 612689.28 51.32 620079.4054 50.29 621367.7751 49.9 518877.8117 58.92 542580.0634 59.84 590366.9549 53.44 621684.6328 49.84
128 621470.9339 101.82 612082.3923 103.21 616273.673 100.16 624179.2062 98.26 562776.0002 107.83 544647.5945 119.16 615277.3483 102.06 610569.8113 100.53

Test Plan

keep ci green

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
@mergify mergify Bot added the frontend label Apr 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the pooling entrypoints to offload pre- and post-processing tasks to a thread pool using make_async. Key changes include renaming internal preprocessing methods for consistency, converting response-building methods from asynchronous to synchronous, and updating PoolingServeContext to make pooling_params a required field. A critical bug was identified in the flash_late_interaction method where _preprocessing_async is incorrectly called at the end of the function instead of _postprocessing_async, which would prevent the final response from being built correctly.

Comment thread vllm/entrypoints/pooling/scoring/serving.py Outdated
noooop and others added 4 commits April 14, 2026 11:35
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
@noooop noooop marked this pull request as ready for review April 14, 2026 05:05
@noooop noooop requested a review from DarkLight1337 April 14, 2026 05:05
@noooop noooop changed the title [Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling models. [Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints. Apr 14, 2026
@DarkLight1337
Copy link
Copy Markdown
Member

How does this compare to calling the async renderer methods?

@noooop
Copy link
Copy Markdown
Collaborator Author

noooop commented Apr 14, 2026

async renderer

524a400

image

The async renderer has the issue of adding a 2ms delay as before. #27407

@noooop
Copy link
Copy Markdown
Collaborator Author

noooop commented Apr 14, 2026

More specifically, after I changed tokenize_prompts_async to tokenize_prompts, the problem disappeared.

image image

Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM then

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) April 14, 2026 06:57
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2026
@DarkLight1337 DarkLight1337 merged commit c0ecaed into vllm-project:main Apr 14, 2026
52 checks passed
@noooop noooop deleted the pooling_entrypoints_using_thread_pool branch April 14, 2026 08:29
zxd1997066 pushed a commit to zxd1997066/vllm that referenced this pull request Apr 15, 2026
…ead pool for pooling entrypoints. (vllm-project#39763)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
…ead pool for pooling entrypoints. (vllm-project#39763)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…ead pool for pooling entrypoints. (vllm-project#39763)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants