[Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop#34789
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
88dc973 to
787320b
Compare
There was a problem hiding this comment.
Code Review
This pull request effectively addresses event loop blocking caused by synchronous, CPU-bound operations like tokenization and multimodal preprocessing. The core change, introducing a shared single-threaded ThreadPoolExecutor in BaseRenderer, is a well-justified and clean solution to serialize access to non-thread-safe HuggingFace tokenizers while unblocking the event loop. The refactoring is consistently applied across various renderers and engine components, centralizing the execution of these blocking tasks. The provided test results clearly demonstrate a significant improvement in API endpoint latency under load. The code is of high quality, and I have no issues to report.
|
Hi @scyyh11, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
87804ae to
c66d990
Compare
|
Hi @scyyh11, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Benchmark ResultsBenchmarked using Text-only (
|
| Metric | main | this PR | diff |
|---|---|---|---|
| Throughput (req/s) | 13.72 | 14.27 | +4.0% |
| Output tok/s | 1756.47 | 1827.01 | +4.0% |
| Mean TTFT (ms) | 432.02 | 342.51 | -20.7% |
| P99 TTFT (ms) | 972.08 | 669.58 | -31.1% |
| Mean TPOT (ms) | 64.57 | 59.82 | -7.4% |
Multimodal (Qwen/Qwen2.5-VL-7B-Instruct)
| Metric | main | this PR | diff |
|---|---|---|---|
| Throughput (req/s) | 4.67 | 4.47 | -4.3% |
| Output tok/s | 597.32 | 572.39 | -4.2% |
| Mean TTFT (ms) | 18355.36 | 18376.02 | +0.1% |
| P99 TTFT (ms) | 29880.78 | 31762.12 | +6.3% |
| Mean TPOT (ms) | 143.60 | 158.13 | +10.1% |
I added conditional offloading in _process_for_engine_async — only multimodal requests are offloaded to the thread pool, while text-only requests are processed directly on the event loop since they only do lightweight dict creation. This avoids serializing all requests through a single thread. No regression for text-only models. Multimodal shows a small throughput decrease (~4%) which is likely within run-to-run variance given the high absolute TTFT values (~18s). TTFT for multimodal is essentially unchanged (+0.1%).
|
Hi @scyyh11, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @scyyh11, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
6d12527 to
876e55a
Compare
|
4% is actually pretty significant. I suggest rerunning the benchmark a few more times to reduce the variance and be more sure. |
8a7c128 to
d3abd89
Compare
Multimodal (
|
| Metric | main (avg ± std) | this PR (avg ± std) | diff |
|---|---|---|---|
| Throughput (req/s) | 4.77 ± 0.04 | 4.56 ± 0.16 | -4.4% |
| Output tok/s | 611.01 ± 6.35 | 583.28 ± 20.09 | -4.5% |
| Mean TTFT (ms) | 17264.45 ± 476 | 18565.44 ± 1622 | +7.5% |
| P99 TTFT (ms) | 28976.39 ± 385 | 30999.60 ± 1459 | +7.0% |
| Mean TPOT (ms) | 144.32 ± 1.00 | 150.37 ± 2.17 | +4.2% |
I agree that there is gap and it probably comes from thread pool context switching overhead. What's your implementation sugguestions? Should I make this configurable?
|
Yes, like the other PR we should make this opt-in. |
MockModelConfig was missing the renderer_num_workers attribute introduced by the thread pool changes, causing 23 test failures in CI when BaseRenderer.__init__ tried to read it. Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>
Head branch was pushed to by a user without write access
|
Both CI failures are known flaky tests on upstream main, tracked in open issues. Neither is related to our changes.
Our PR does not touch speculative decoding, EAGLE, data parallelism, engine abort logic, KV connectors or scheduler output handling. |
|
Hi @DarkLight1337, the CI failed but not due to our changes, could you manually merge? |
|
unblock Language Models Test (Extended Pooling) & Multi-Modal Models (Extended Pooling) |
…ock event loop (vllm-project#34789) Signed-off-by: Bvicii <yizhanhuang2002@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…ock event loop (vllm-project#34789) Signed-off-by: Bvicii <yizhanhuang2002@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
Fix event loop blocking caused by multimodal request preprocessing (base64 decoding, image transforms, HF processor operations) and chat template rendering. Under high concurrency, these synchronous CPU-bound operations block the asyncio event loop, causing
/health,/v1/models, and/metricsendpoints to become unresponsive (P95 latency >200ms, with spikes over 1s).Changes:
ThreadPoolExecutoronBaseRenderer(size controlled by--preprocessing-thread-pool-workers, default 1)HfRenderer,MistralRenderer,DeepseekV32Renderer, andGrok2Rendererwith the shared executor viamake_asyncMistralRenderer's separateThreadPoolExecutorinto the shared oneclear_mm_cachethrough the shared executor to avoid races with concurrentprocess_inputson themm_processor_cacheTest Plan
Benchmarked on 1x NVIDIA A100-SXM4-80GB using
vllm bench servewith--request-rate 20 --num-prompts 200and a custom high-concurrency benchmark with PaddleOCR-VL-1.5.Tests performed:
vllm bench servewith Llama-3.1-8B-Instruct (text-only,--request-rate 20 --num-prompts 200)vllm bench servewith Qwen2.5-VL-7B-Instruct (multimodal,--request-rate 20 --num-prompts 200)--preprocessing-thread-pool-workerscomparison (1 vs 2 vs 4) with PaddleOCR-VL-1.5Test Results
1. Text-Only (
meta-llama/Llama-3.1-8B-Instruct)vllm bench serve --request-rate 20 --num-prompts 200No regression. All metrics within noise.
2. Multimodal (
Qwen/Qwen2.5-VL-7B-Instruct)vllm bench serve --request-rate 20 --num-prompts 200 --backend openai-chat --dataset-name random-mmNo regression. All metrics within noise.
3. PaddleOCR-VL-1.5 High Concurrency (500 prompts, 300 concurrency)
Custom benchmark with real OmniDocBench document images.
--max-num-batched-tokens 131072 --no-enable-prefix-caching --mm-processor-cache-gb 0 --gpu-memory-utilization 0.5Event loop stays fully responsive under high multimodal concurrency. The
/healthendpoint drops from 222ms to <1ms median.4.
--preprocessing-thread-pool-workersComparison (PaddleOCR-VL-1.5, 500 prompts, 300 concurrency)All worker counts perform identically. This is consistent with #34789 (comment).
The key improvement comes from offloading preprocessing off the event loop (so
/healthstays responsive), not from parallelizing it. Default ofworkers=1is sufficient.Summary
/health)Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.