Skip to content

[Frontend] Speed up online server preprocess by using sync tokenizer.#27407

Closed
noooop wants to merge 7 commits intovllm-project:mainfrom
noooop:server_overhead
Closed

[Frontend] Speed up online server preprocess by using sync tokenizer.#27407
noooop wants to merge 7 commits intovllm-project:mainfrom
noooop:server_overhead

Conversation

@noooop
Copy link
Copy Markdown
Collaborator

@noooop noooop commented Oct 23, 2025

TL;DR

  • Using a sync tokenizer is faster than using an async_tokenizer.
  • set --api-server-count=4 as default

Purpose

After using intelligent manual printing, it was found that preprocessing (_preprocess) has significant overhead.

9800X3d + 4090
start:  4309.600659041
(APIServer pid=5341) create_embedding 4309.601211381   + 0.5523ms
(APIServer pid=5341) _check_model 4309.601238849 + 0.027ms
(APIServer pid=5341) _validate_request 4309.601243159 + 0.004ms
(APIServer pid=5341) _preprocess 4309.60345763 + 2.2144ms
(APIServer pid=5341) _prepare_generators 4309.603477969 + 0.0203ms
(APIServer pid=5341) _collect_batch 4309.605789944 + 2.3119ms
(APIServer pid=5341) _build_response 4309.605852611 + 0.0626ms
end:  4309.606079509 + 0.2268ms
total + 5.4204ms
INTEL(R) XEON(R) PLATINUM 8575C + H20
start:  1218256.633603412
(APIServer pid=4094607) create_embedding 1218256.634584434 + 0.9810ms
(APIServer pid=4094607) _check_model 1218256.634622441 + 0.0380ms
(APIServer pid=4094607) _validate_request 1218256.634629259 + 0.0068ms
(APIServer pid=4094607) _preprocess 1218256.637000155 + 2.3708ms
(APIServer pid=4094607) _prepare_generators 1218256.637029284 + 0.0291ms
(APIServer pid=4094607) _collect_batch 1218256.641474206 + 4.4449ms
(APIServer pid=4094607) _build_response 1218256.641554558 + 0.0803ms
end:  1218256.641903234 + 0.3486 ms
total + 8.2998 ms

online _preprocess is too slow

  1. Using a sync tokenizer is faster than using an async_tokenizer.

https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_offline.py https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_online.py

main

  • offline 2.2479979998024646 ms
  • online 5.60473100085801 ms

this pr

  • offline 2.157447999707074 ms
  • online 3.5696109989658 ms

Let's reduce this overhead.

  1. Under high concurrency:
    https://github.com/noooop/snippet/blob/main/benchmarks/embed3/v1_online_high.py
image
n_clients Throughput: this pr+c=1 Throughput: this pr+c=2 Throughput: this pr+c=4 Throughput: main+c=1 Throughput: main+c=2 Throughput: main+c=4
1 140679.9683 3.57 141669.6149 3.54 142370.7607 3.52 88532.0299 5.71 88525.5506 5.7 88874.9545 5.68
2 286591.4312 3.51 285781.1775 3.52 285443.7082 3.52 178379.0391 5.67 179524.1003 5.63 179190.3731 5.64
4 307997.374 6.54 413911.9182 4.87 415262.3392 4.85 280711.5651 7.22 297491.0479 6.81 299423.5705 6.77
8 433038.121 9.28 479260.7028 8.31 531027.9076 7.42 277196.4544 14.13 419518.878 9.57 425106.8568 9.44
16 525157.8099 15.24 535250.7142 14.71 564387.5684 13.98 421348.0377 18.57 515229.7849 15.33 535589.0894 14.77
32 546190.4314 29.63 584699.5098 26.73 611143.6974 25.46 464678.7363 33.23 562605.6325 27.74 604653.0701 26.04
64 542580.0634 59.84 634370.7292 48.58 652712.9588 47.61 518877.8117 58.92 639422.5549 48.06 640084.9696 48.44
128 544647.5945 119.16 636870.1266 95.65 648662.147 93.92 562776.0002 107.83 645368.9981 94.58 641442.9277 95.1

Requires --api-server-count=4 as default to avoid frontend CPU bottleneck, leading to reduced throughput.

cc @DarkLight1337

Test Plan

keep ci green

Test Result

keep ci green


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added the frontend label Oct 23, 2025
@DarkLight1337
Copy link
Copy Markdown
Member

How much does the overhead increase with concurrency?

Comment thread vllm/entrypoints/renderer.py Outdated
@noooop noooop changed the title [Test] Test embed server overhead [Frontend] Speed up online server preprocess. Oct 29, 2025
@noooop noooop changed the title [Frontend] Speed up online server preprocess. [Frontend] Speed up online server preprocess by using sync tokenizer. Oct 29, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop marked this pull request as ready for review October 29, 2025 10:32
@noooop
Copy link
Copy Markdown
Collaborator Author

noooop commented Oct 29, 2025

cc @DarkLight1337

Ready to review

@DarkLight1337
Copy link
Copy Markdown
Member

cc @njhill

@DarkLight1337
Copy link
Copy Markdown
Member

The reason for this is #25301 (comment)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm/entrypoints/renderer.py
@noooop
Copy link
Copy Markdown
Collaborator Author

noooop commented Oct 29, 2025

The reason for this is #25301 (comment)

API sever count works like tokenizer process pool,but lower overhead


image

X-axis: Throughput (token/s)
Y-axis: Latency, Time needed for one step (ms) <- logarithmic scale
The curve lower right is better ↘

api-server-count=1 (this pr+c=1) will indeed cause a performance bottleneck, increasing api-server-count can significantly improve performance

This PR compared to main, end-to-end delay is at least 2ms lower, which means async_tokenizer has a 2ms overhead. This overhead is too large for embedding tasks that use small models.

@noooop
Copy link
Copy Markdown
Collaborator Author

noooop commented Oct 30, 2025

image

More results

Now, if a sync tokenizer is used (this pr), the online api-server-count=4 is almost always exceeding sentence_transformers (sentence_transformers offline is a high baseline) across the entire runtime range.

The final throughput is consistent with offline throughput or even slightly higher (since offline preprocessing is in the main thread, while online preprocessing is in the api server thread, effectively using multiple threads for preprocessing)

Although there is still a latency gap of around 2ms between vllm offline and vllm online currently, we can perform further optimizations, although it may be difficult to completely eliminate it.

@noooop noooop mentioned this pull request Jan 4, 2026
1 task
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend stale Over 90 days of inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants