[Frontend] Speed up online server preprocess by using sync tokenizer. by noooop · Pull Request #27407 · vllm-project/vllm

noooop · 2025-10-23T10:30:38Z

TL;DR

Using a sync tokenizer is faster than using an async_tokenizer.
set --api-server-count=4 as default

Purpose

After using intelligent manual printing, it was found that preprocessing (_preprocess) has significant overhead.

9800X3d + 4090
start:  4309.600659041
(APIServer pid=5341) create_embedding 4309.601211381   + 0.5523ms
(APIServer pid=5341) _check_model 4309.601238849 + 0.027ms
(APIServer pid=5341) _validate_request 4309.601243159 + 0.004ms
(APIServer pid=5341) _preprocess 4309.60345763 + 2.2144ms
(APIServer pid=5341) _prepare_generators 4309.603477969 + 0.0203ms
(APIServer pid=5341) _collect_batch 4309.605789944 + 2.3119ms
(APIServer pid=5341) _build_response 4309.605852611 + 0.0626ms
end:  4309.606079509 + 0.2268ms
total + 5.4204ms

INTEL(R) XEON(R) PLATINUM 8575C + H20
start:  1218256.633603412
(APIServer pid=4094607) create_embedding 1218256.634584434 + 0.9810ms
(APIServer pid=4094607) _check_model 1218256.634622441 + 0.0380ms
(APIServer pid=4094607) _validate_request 1218256.634629259 + 0.0068ms
(APIServer pid=4094607) _preprocess 1218256.637000155 + 2.3708ms
(APIServer pid=4094607) _prepare_generators 1218256.637029284 + 0.0291ms
(APIServer pid=4094607) _collect_batch 1218256.641474206 + 4.4449ms
(APIServer pid=4094607) _build_response 1218256.641554558 + 0.0803ms
end:  1218256.641903234 + 0.3486 ms
total + 8.2998 ms

online _preprocess is too slow

Using a sync tokenizer is faster than using an async_tokenizer.

https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_offline.py https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_online.py

main

offline 2.2479979998024646 ms
online 5.60473100085801 ms

this pr

offline 2.157447999707074 ms
online 3.5696109989658 ms

Let's reduce this overhead.

Under high concurrency:
https://github.com/noooop/snippet/blob/main/benchmarks/embed3/v1_online_high.py

n_clients	Throughput:	this pr+c=1	Throughput:	this pr+c=2	Throughput:	this pr+c=4	Throughput:	main+c=1	Throughput:	main+c=2	Throughput:	main+c=4
1	140679.9683	3.57	141669.6149	3.54	142370.7607	3.52	88532.0299	5.71	88525.5506	5.7	88874.9545	5.68
2	286591.4312	3.51	285781.1775	3.52	285443.7082	3.52	178379.0391	5.67	179524.1003	5.63	179190.3731	5.64
4	307997.374	6.54	413911.9182	4.87	415262.3392	4.85	280711.5651	7.22	297491.0479	6.81	299423.5705	6.77
8	433038.121	9.28	479260.7028	8.31	531027.9076	7.42	277196.4544	14.13	419518.878	9.57	425106.8568	9.44
16	525157.8099	15.24	535250.7142	14.71	564387.5684	13.98	421348.0377	18.57	515229.7849	15.33	535589.0894	14.77
32	546190.4314	29.63	584699.5098	26.73	611143.6974	25.46	464678.7363	33.23	562605.6325	27.74	604653.0701	26.04
64	542580.0634	59.84	634370.7292	48.58	652712.9588	47.61	518877.8117	58.92	639422.5549	48.06	640084.9696	48.44
128	544647.5945	119.16	636870.1266	95.65	648662.147	93.92	562776.0002	107.83	645368.9981	94.58	641442.9277	95.1

Requires --api-server-count=4 as default to avoid frontend CPU bottleneck, leading to reduced throughput.

cc @DarkLight1337

Test Plan

keep ci green

Test Result

keep ci green

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

DarkLight1337 · 2025-10-23T11:42:32Z

How much does the overhead increase with concurrency?

Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-29T10:32:24Z

cc @DarkLight1337

Ready to review

DarkLight1337 · 2025-10-29T10:35:27Z

cc @njhill

DarkLight1337 · 2025-10-29T10:36:51Z

The reason for this is #25301 (comment)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

noooop · 2025-10-29T12:20:57Z

The reason for this is #25301 (comment)

API sever count works like tokenizer process pool，but lower overhead

X-axis: Throughput (token/s)
Y-axis: Latency, Time needed for one step (ms) <- logarithmic scale
The curve lower right is better ↘

api-server-count=1 (this pr+c=1) will indeed cause a performance bottleneck, increasing api-server-count can significantly improve performance

This PR compared to main, end-to-end delay is at least 2ms lower, which means async_tokenizer has a 2ms overhead. This overhead is too large for embedding tasks that use small models.

noooop · 2025-10-30T07:18:52Z

More results

sentence_transformers offline
vllm offline
vllm online this pr + api-server-count=4
vllm online this pr + api-server-count=4 + bytes response [Frontend][3/N] Improve all pooling task | Support binary embedding response #27066
vllm main + api-server-count=4

Now, if a sync tokenizer is used (this pr), the online api-server-count=4 is almost always exceeding sentence_transformers (sentence_transformers offline is a high baseline) across the entire runtime range.

The final throughput is consistent with offline throughput or even slightly higher (since offline preprocessing is in the main thread, while online preprocessing is in the api server thread, effectively using multiple threads for preprocessing)

Although there is still a latency gap of around 2ms between vllm offline and vllm online currently, we can perform further optimizations, although it may be difficult to completely eliminate it.

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

github-actions · 2026-02-16T02:16:07Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify Bot added the frontend label Oct 23, 2025

noooop force-pushed the server_overhead branch from e91945e to 19c8c8e Compare October 24, 2025 09:35

noooop commented Oct 24, 2025

View reviewed changes

Comment thread vllm/entrypoints/renderer.py Outdated

noooop changed the title ~~[Test] Test embed server overhead~~ [Frontend] Speed up online server preprocess. Oct 29, 2025

noooop changed the title ~~[Frontend] Speed up online server preprocess.~~ [Frontend] Speed up online server preprocess by using sync tokenizer. Oct 29, 2025

Using a sync tokenizer is faster than using an async_tokenizer.

ccc1edd

Signed-off-by: wang.yuqi <noooop@126.com>

noooop force-pushed the server_overhead branch from c361572 to ccc1edd Compare October 29, 2025 10:14

noooop added 4 commits October 29, 2025 18:15

Merge branch 'main' into server_overhead

455c412

mypy

5efcf55

Signed-off-by: wang.yuqi <noooop@126.com>

update

a365ea1

Signed-off-by: wang.yuqi <noooop@126.com>

mypy

e1b90cc

Signed-off-by: wang.yuqi <noooop@126.com>

noooop marked this pull request as ready for review October 29, 2025 10:32

noooop requested review from aarnphm and chaunceyjiang as code owners October 29, 2025 10:32

chatgpt-codex-connector Bot reviewed Oct 29, 2025

View reviewed changes

Comment thread vllm/entrypoints/renderer.py

noooop mentioned this pull request Nov 3, 2025

[Usage]: Problem with concurrency in encoder-based embedder serving with V1 Engine #25842

Closed

1 task

noooop requested a review from njhill November 17, 2025 06:13

noooop closed this Nov 17, 2025

noooop force-pushed the server_overhead branch from e61866a to 3b7bdf9 Compare November 17, 2025 06:55

noooop reopened this Nov 17, 2025

noooop added 2 commits November 17, 2025 14:59

Merge branch 'main' into server_overhead

685b7f7

set --api-server-count=4 as default

d7a7de6

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

noooop force-pushed the server_overhead branch from 3f36ddf to d7a7de6 Compare November 17, 2025 07:24

noooop mentioned this pull request Jan 4, 2026

[Performance]: embedding #31678

Open

1 task

github-actions Bot added the stale Over 90 days of inactivity label Feb 16, 2026

noooop closed this Mar 11, 2026

noooop deleted the server_overhead branch March 11, 2026 11:27

noooop mentioned this pull request Apr 2, 2026

[Bugfix] Use async preprocessing in pooling/embedding endpoints #38653

Closed

4 tasks

noooop mentioned this pull request Apr 14, 2026

[Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints. #39763

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Speed up online server preprocess by using sync tokenizer.#27407

[Frontend] Speed up online server preprocess by using sync tokenizer.#27407
noooop wants to merge 7 commits intovllm-project:mainfrom
noooop:server_overhead

noooop commented Oct 23, 2025 •

edited by github-actions Bot

Loading

Uh oh!

DarkLight1337 commented Oct 23, 2025

Uh oh!

Uh oh!

noooop commented Oct 29, 2025

Uh oh!

DarkLight1337 commented Oct 29, 2025

Uh oh!

DarkLight1337 commented Oct 29, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

noooop commented Oct 29, 2025 •

edited

Loading

Uh oh!

noooop commented Oct 30, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

noooop commented Oct 23, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Purpose

Test Plan

Test Result

Uh oh!

DarkLight1337 commented Oct 23, 2025

Uh oh!

Uh oh!

noooop commented Oct 29, 2025

Uh oh!

DarkLight1337 commented Oct 29, 2025

Uh oh!

DarkLight1337 commented Oct 29, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

noooop commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noooop commented Oct 23, 2025 •

edited by github-actions Bot

Loading

noooop commented Oct 29, 2025 •

edited

Loading

noooop commented Oct 30, 2025 •

edited

Loading