[Frontend] Speed up online server preprocess by using sync tokenizer.#27407
[Frontend] Speed up online server preprocess by using sync tokenizer.#27407noooop wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
How much does the overhead increase with concurrency? |
e91945e to
19c8c8e
Compare
Signed-off-by: wang.yuqi <noooop@126.com>
c361572 to
ccc1edd
Compare
|
Ready to review |
|
cc @njhill |
|
The reason for this is #25301 (comment) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
API sever count works like tokenizer process pool,but lower overhead
X-axis: Throughput (token/s) api-server-count=1 (this pr+c=1) will indeed cause a performance bottleneck, increasing api-server-count can significantly improve performance This PR compared to main, end-to-end delay is at least 2ms lower, which means async_tokenizer has a 2ms overhead. This overhead is too large for embedding tasks that use small models. |
More results
Now, if a sync tokenizer is used (this pr), the online api-server-count=4 is almost always exceeding sentence_transformers (sentence_transformers offline is a high baseline) across the entire runtime range. The final throughput is consistent with offline throughput or even slightly higher (since offline preprocessing is in the main thread, while online preprocessing is in the api server thread, effectively using multiple threads for preprocessing) Although there is still a latency gap of around 2ms between vllm offline and vllm online currently, we can perform further optimizations, although it may be difficult to completely eliminate it. |
e61866a to
3b7bdf9
Compare
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
3f36ddf to
d7a7de6
Compare
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |


TL;DR
Purpose
After using intelligent manual printing, it was found that preprocessing (_preprocess) has significant overhead.
online _preprocess is too slow
https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_offline.py https://github.com/noooop/snippet/blob/main/benchmarks/embed3/overhead_online.py
main
this pr
Let's reduce this overhead.
https://github.com/noooop/snippet/blob/main/benchmarks/embed3/v1_online_high.py
Requires --api-server-count=4 as default to avoid frontend CPU bottleneck, leading to reduced throughput.
cc @DarkLight1337
Test Plan
keep ci green
Test Result
keep ci green
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.