[Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints. by noooop · Pull Request #39763 · vllm-project/vllm

noooop · 2026-04-14T03:28:36Z

Purpose

Following #34789

Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints.

rename _preprocess_completion_online ->_preprocess_cmpl_online

Using a thread pool barely adds any overhead.

https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_offline.py https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_online.py

This largely addresses the 2ms latency regression introduced by the asynchronous tokenizer. found in #27407

main (sync tokenizer)

online 3.5837650002576993
offline 2.173177999793552

this pr (thread pool)

online 3.5909150001316448
offline 2.1705790004489245

Under high concurrency:

https://github.com/noooop/snippet/blob/main/benchmarks/embed3/v1_online_high.py

n_clients	Throughput:	this pr (thread pool)	Throughput:	main (sync)	Throughput:	this pr (thread pool) + api * 4	Throughput:	main (sync) + api * 4	Throughput:	main(#27407)	Throughput:	#27407	Throughput:	async renderer	Throughput:	async renderer + api * 4
1	145312.5521	3.44	149284.4683	3.36	144803.5375	3.46	150063.0409	3.34	88532.0299	5.71	140679.9683	3.57	90493.2526	5.58	90592.7286	5.58
2	278668.8227	3.6	277037.4636	3.63	276058.5841	3.64	275645.3112	3.65	178379.0391	5.67	286591.4312	3.51	186279.8288	5.43	183518.5588	5.51
4	379321.3826	5.32	367733.2157	5.49	392391.8923	5.13	397400.8369	5.07	280711.5651	7.22	307997.374	6.54	267043.8983	7.59	292713.1852	6.92
8	485165.0423	8.23	471394.2763	8.5	510324.2992	7.73	518657.7842	7.61	277196.4544	14.13	433038.121	9.28	383520.4782	10.51	418086.4158	9.6
16	563471.4011	14.13	524841.5681	15.09	581337.1984	13.48	585431.7416	13.29	421348.0377	18.57	525157.8099	15.24	525778.7223	15.19	525521.3027	14.98
32	600047.0408	26.52	572158.3257	27.67	599751.1129	26.21	597702.8949	26.12	464678.7363	33.23	546190.4314	29.63	538383.3793	29.41	580194.0677	26.97
64	619535.4827	51.3	612689.28	51.32	620079.4054	50.29	621367.7751	49.9	518877.8117	58.92	542580.0634	59.84	590366.9549	53.44	621684.6328	49.84
128	621470.9339	101.82	612082.3923	103.21	616273.673	100.16	624179.2062	98.26	562776.0002	107.83	544647.5945	119.16	615277.3483	102.06	610569.8113	100.53

Test Plan

keep ci green

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

gemini-code-assist

Code Review

This pull request refactors the pooling entrypoints to offload pre- and post-processing tasks to a thread pool using make_async. Key changes include renaming internal preprocessing methods for consistency, converting response-building methods from asynchronous to synchronous, and updating PoolingServeContext to make pooling_params a required field. A critical bug was identified in the flash_late_interaction method where _preprocessing_async is incorrectly called at the end of the function instead of _postprocessing_async, which would prevent the final response from being built correctly.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

DarkLight1337 · 2026-04-14T05:12:33Z

How does this compare to calling the async renderer methods?

noooop · 2026-04-14T06:39:17Z

async renderer

524a400

The async renderer has the issue of adding a 2ms delay as before. #27407

noooop · 2026-04-14T06:45:30Z

More specifically, after I changed tokenize_prompts_async to tokenize_prompts, the problem disappeared.

DarkLight1337

Thanks, LGTM then

…ead pool for pooling entrypoints. (vllm-project#39763) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com>

…ead pool for pooling entrypoints. (vllm-project#39763) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ead pool for pooling entrypoints. (vllm-project#39763) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

init

8ec0672

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

mergify Bot added the frontend label Apr 14, 2026

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread vllm/entrypoints/pooling/scoring/serving.py Outdated

noooop and others added 4 commits April 14, 2026 11:35

Update vllm/entrypoints/pooling/scoring/serving.py

dd1f62c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>

Merge branch 'main' into pooling_entrypoints_using_thread_pool

aa843a0

fix tests

1b9d826

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

Merge branch 'main' into pooling_entrypoints_using_thread_pool

9e797d5

noooop marked this pull request as ready for review April 14, 2026 05:05

noooop requested a review from DarkLight1337 April 14, 2026 05:05

noooop changed the title ~~[Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling models.~~ [Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints. Apr 14, 2026

DarkLight1337 approved these changes Apr 14, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 14, 2026 06:57

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2026

noooop mentioned this pull request Apr 14, 2026

[Bugfix] Use async preprocessing in pooling/embedding endpoints #38653

Closed

4 tasks

DarkLight1337 merged commit c0ecaed into vllm-project:main Apr 14, 2026
52 checks passed

noooop deleted the pooling_entrypoints_using_thread_pool branch April 14, 2026 08:29

noooop mentioned this pull request Apr 16, 2026

feat(pooling): Add dedicated async preprocessing support to PluginWithIOProcessorPlugins #40030

Closed

2 tasks

noooop mentioned this pull request Apr 25, 2026

[RFC]: Rust front-end #40846

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints.#39763

[Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints.#39763
DarkLight1337 merged 5 commits intovllm-project:mainfrom
noooop:pooling_entrypoints_using_thread_pool

noooop commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

DarkLight1337 commented Apr 14, 2026

Uh oh!

noooop commented Apr 14, 2026 •

edited

Loading

Uh oh!

noooop commented Apr 14, 2026

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

noooop commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

DarkLight1337 commented Apr 14, 2026

Uh oh!

noooop commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Apr 14, 2026

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noooop commented Apr 14, 2026 •

edited

Loading

noooop commented Apr 14, 2026 •

edited

Loading