[BUGFIX] Make HF input preprocessing async to prevent frontend event loop blocking#33337
[BUGFIX] Make HF input preprocessing async to prevent frontend event loop blocking#33337yzhu802 wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yufeng Zhu <yzhu802@gatech.edu>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the issue of the frontend event loop being blocked by moving the synchronous process_inputs call to a ThreadPoolExecutor. The implementation is clear, and the use of a lock to maintain thread safety for the InputProcessor is a practical solution for this bugfix. My review includes one suggestion to enhance the shutdown process for the newly introduced thread pool, ensuring a more graceful termination.
| if executor := getattr(self, "_input_processor_executor", None): | ||
| executor.shutdown(wait=False) |
There was a problem hiding this comment.
Calling shutdown(wait=False) will not cancel pending tasks in the executor. This can delay the process exit if there are long-running preprocessing tasks in the queue.
In Python 3.9+, ThreadPoolExecutor.shutdown() accepts a cancel_futures=True argument, which will cancel any pending tasks that have not started. This allows for a more graceful and faster shutdown.
This suggestion adds a version check to use this feature where available, making the shutdown process more robust.
| if executor := getattr(self, "_input_processor_executor", None): | |
| executor.shutdown(wait=False) | |
| if executor := getattr(self, "_input_processor_executor", None): | |
| import sys | |
| if sys.version_info >= (3, 9): | |
| executor.shutdown(wait=False, cancel_futures=True) | |
| else: | |
| executor.shutdown(wait=False) |
|
I tried something similar to this a while ago and found a modest improvement in some cases: #17831 But eventually we decided that API server scale-out is easier to do and more effective. |
|
@DarkLight1337 , Thanks for pointing out #17831! I fully agree that api-server scale-out is the most effective way to improve overall throughput and handle CPU-intensive tasks. My PR is actually intended to complement that approach rather than replace it. While multi-processing increases the total processing capacity, ensuring the event loop remains non-blocking is still critical for system stability, especially under bursty traffic. Here is why I believe this patch is necessary:
In short, while scale-out raises the ceiling, this async fix raises the floor for stability. Given that the changes are contained and use a coarse-grained lock for safety, I think it offers a valuable robustness improvement for edge cases. |
|
Previously, @DarkLight1337 @njhill and I have had some attempts comparsing running the HF processor in MT vs MP - if we were to add MT for async processor we also need to consider the fact that it will not work with the multimodal feature cache at the moment IIUC. Have you tried to compare this PR to main branch with |
|
Thanks @yzhu802! I agree it's probably good to do something like this. I don't think the lock is needed though and I don't see why > 1 thread is needed. We should be able to just use
Just to have a clear understanding, were these comparisons relative to a baseline with |
Signed-off-by: Yufeng Zhu <yzhu802@gatech.edu>
|
Hi @yzhu802, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Thanks for the valuable feedback! I have updated the PR to address the complexity concerns and conducted additional benchmarks with "--api-server-count 2" as requested. 1. Code Simplification@njhill, You are absolutely right. Since the goal is simply to offload the CPU-intensive preprocessing from the asyncio event loop—rather than achieving parallel execution—I have: Note on Thread Safety & Cache: Using max_workers=1 effectively serializes access to the HF processor/cache, mimicking the original serial behavior but without blocking the main event loop. This should mitigate the concerns regarding the multimodal feature cache thread-safety mentioned by @ywang96. 2. Benchmark with api-server-count=2I ran the stress test again comparing main vs. this PR with --api-server-count 2. While there is a minor overhead in throughput and TTFT due to thread context switching (as expected with the introduction of the thread pool), the critical improvement is the elimination of request failures due to freezing event loop(0.00% Error Rate vs 3.73% Error Rate). 3. Context on Downstream Orchestration (vllm-omni)There is an architectural reason why keeping the event loop non-blocking is critical for us. In vllm-omni, the frontend process acts as a generic orchestrator. It must forward intermediate results from Stage 0 to Stage 1 continuously. If the event loop blocks during HF preprocessing, forwarding halts, causing pipeline stalls across the entire multi-stage setup. This fix ensures the frontend remains responsive to coordinate the pipeline, which is a requirement for advanced multimodal orchestration. I have uploaded the reproduction scripts to https://github.com/yzhu802/vllm-stress-test for reference. |
Signed-off-by: Yufeng Zhu <yzhu802@gatech.edu>
|
Given the overhead, I think we should make this configurable and have it set to off by default in the main repo. |
|
Thanks @yzhu802, yes based on those results this change appears to actually degrade overall performance in all cases? To clarify - was Do you think you could repeat the same test with Also in these runs, what was the attained QPS in each case? If much lower than the load driver QPS then we're kind of testing an overloaded server situation where we will want to do some load-shedding / rate-limiting anyhow (there's other in-flight work for that...). Other observations:
|
|
Thanks @njhill and @DarkLight1337 — these are all very fair points! 🙂 Experiment Clarification Yes, the comparison is fair: both runs use the same end-to-end stress test, with the only difference being For the second experiment, the key parameters are: All other settings are kept identical across runs. Full scripts are available here: https://github.com/yzhu802/vllm-stress-test On Performance vs. Saturation You're absolutely right that these experiments operate in a saturated / overloaded regime. In a typical inference deployment, this is where load-shedding or rate-limiting should eventually kick in, and I agree this PR is not meant to address that. Similarly, comparing a single background thread (with context-switch overhead) against the main event loop is not intended to demonstrate a throughput win. Achieving real speedups would require finer-grained locking or true parallelism, which is out of scope here. (Note In the degenerate case of api-server-count = 1, this change can also incidentally improve TTFT & Throughput when a considerate percent of requests are videos, but that is not the primary motivation.) Intent of this PR: liveness, not throughput The core goal of this PR is liveness. In the current main branch, synchronous HF preprocessing can block the asyncio event loop entirely under load. When that happens, the frontend process becomes unresponsive: it cannot accept new connections, maintain heartbeats, or forward intermediate results. This leads to failure modes that are independent of model inference speed. With this change, the event loop remains alive even when the system is saturated. While there is a small and expected performance overhead, the frontend avoids pathological failures such as connection timeouts caused by a frozen event loop. This property is particularly important for orchestration-heavy setups like vllm-omni (an official downstream project under the vLLM org), where the frontend event loop acts as a coordinator between multiple stages and must continue forwarding results even when preprocessing large multimodal inputs. Configurability Given the architectural nature of this requirement and the fact that it introduces a small overhead for standard users, I fully agree this should be configurable and off by default. My Proposal: Introduce a flag (e.g. --enable-async-preprocessor, default = False) Future Work Longer-term, I’m happy to propose an RFC exploring finer-grained locking or making the multimodal cache thread-safe to enable true parallel preprocessing. I intentionally kept this PR minimal and localized to reduce review and correctness risk. Happy to update the PR accordingly if this direction sounds reasonable. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing as superseded by #34789 |
Purpose
Improve the issue where the frontend process event loop becomes blocked under medium to high concurrency with multimodal requests, due to preprocessing in the Hugging Face (HF) processor. This improvement helps enhance the performance of the vLLM framework in multimodal scenarios.
In addition, this change is critical for concurrent testing in the downstream framework vllm-omni. In multi-stage orchestration, the frontend process is responsible for forwarding the computation results of each stage to the next one. If the event loop is blocked, forwarding is delayed, leading to severe pipeline stalls.
Test Plan
The service was deployed on 2× A800 GPUs (tp=2) using Kubernetes, and end-to-end stress tests were conducted under production-like conditions. To eliminate network latency jitter, the client and server communicated via ClusterIP within the cluster.
The test data distribution consisted of 40% text-only, 30% video, and 30% image requests. Samples were drawn with replacement from a large dataset, making cache hits unlikely.
Tests were conducted at QPS = 5, 10, 20, and 50, with the client sending requests at intervals following a Poisson distribution. Each test lasted 10 seconds. The server pod was restarted between tests to eliminate caching effects.
(Test files are not included in this PR. If needed, I can add them to the PR or share them through other means.)
Test Result
Under moderate concurrency (QPS = 20), this simple fix resulted in a 25% reduction in average TTFT and a 37% reduction in p99 TTFT.
Under high concurrency (QPS = 50), it eliminated ServerTimeoutError issues caused by frontend process blocking. With a timeout threshold of 10 seconds, 11.83% of requests triggered this network error in the original implementation.
QPS : 20.0 | before bugfix | after bugfix
Throughput (tok/s) | 139.98 | 152.79
Avg TTFT (ms) | 3455.25 | 2611.99
P99 TTFT (ms) | 7226.91 | 4604.19
Error Rate | 0.00% | 0.00%
QPS : 50.0 | before bugfix | after bugfix
Throughput (tok/s) | 156.83 | 150.32
Avg TTFT (ms) | 10621.45 | 10669.77
P99 TTFT (ms) | 18253.06 | 18711.43
Error Rate | 11.83% | 0.00%
This is a minimal-change solution and is not performance-optimal. To ensure thread safety under multi-threaded execution and to keep all changes contained within a single file for easier review, I intentionally chose a relatively coarse lock granularity.
If you are interested in further improving performance in this scenario, I would be happy to propose an RFC to address it.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.