[Frontend] Add server load limit with --max-server-load parameter#22805
[Frontend] Add server load limit with --max-server-load parameter#22805scratch-ml wants to merge 8 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable server load limiting feature to prevent server overload. The implementation is mostly correct and well-tested for single-threaded scenarios. However, I've identified a critical race condition in the load checking logic that could allow the server to exceed the configured maximum load under concurrent requests. Additionally, there's a potential for an AttributeError if the server load metric is not yet initialized. My review comment details these issues and suggests a path to resolution using asyncio.Lock to ensure atomicity.
- Add max_server_load parameter to FrontendArgs for setting concurrent request limit - Initialize max_server_load state in init_app_state function - Add load checking logic in load_aware_call decorator - Return HTTP 503 with detailed error message when server is overloaded - Only effective when --enable-server-load-tracking is enabled - Add comprehensive tests for new functionality This feature prevents server overload in production deployments by allowing administrators to set a maximum number of concurrent requests. When the limit is exceeded, new requests receive HTTP 503 responses with clear error messages. Signed-off-by: scratch-ml <limingliang0527@gmail.com>
32c80d6 to
c8917c2
Compare
Signed-off-by: scratch-ml <limingliang0527@gmail.com>
Signed-off-by: scratch-ml <limingliang0527@gmail.com>
Could you elaborate on how this could happen? From my understanding, vLLM queues the excess requests and sends requests into the engine only when the engine can take the next batch, so the engine core should not crash from having too many requests at once. |
@DarkLight1337 In our multimodal scenario services, we have observed that requests containing multiple images can cause the vLLM inference engine (version: 0.8.5) to crash. We implement the --max-image-num parameter to control the maximum number of images per inference request. When the batch size is 1, a single request with up to 50 images can be processed normally; however, when the batch size exceeds 1, the "Engine Dead" issue becomes reproducible. The trace log at the time of failure is as follows: Therefore, in multimodal scenarios, the length of individual requests needs to be controlled, and in fact, the quantity of requests is also crucial. Fundamentally, HTTP services require the capability to enforce concurrency limits. In most production environments, sequence length distribution typically fluctuates minimally, making concurrency volume a critical factor affecting service quality. Consequently, when concurrency is high, actively rejecting requests serves to protect the quality of service for in-flight requests on high-load instances. Rejected requests can be promptly rescheduled rather than queuing in a waiting or pending state. The above are my humble thoughts. If there are any inaccuracies, please feel free to point them out. Thank you for reading. |
|
If OOM occurs during inference, it signals that there is a bug inside vLLM. Which model were you using and how did you serve it? The memory profiling has gotten better since v0.8.5 so if you upgrade vLLM, this problem should not occur anymore. |
The model is a closed-source model, and the command we use to start the service is Setting aside the engine crash issue, I still believe this new Thank you for your help~ |
|
I think it's better to add a layer on top of vLLM that queries |
Signed-off-by: scratch-ml <limingliang0527@gmail.com>
I appreciate the suggestion to add a layer querying the That said, I lean towards active rejection at the entry point instead. Checking We also value input from @robertgshaw2-redhat and @njhill — your insights would certainly strengthen this approach. |
njhill
left a comment
There was a problem hiding this comment.
Thanks @scratch-ml, I think this is a nice lightweight approach for basic load-shedding.
Ultimately it would be good to reject inference requests based on a estimated queue waiting time, which would depend on the current contents of the running and waiting queues ... i.e. taking into account input / expected output token counts of running/queued reqs, per-token latency etc.
chaunceyjiang
left a comment
There was a problem hiding this comment.
Compared to #21352, this PR is indeed more lightweight. One question: does it work together with --api-server-count?
|
@chaunceyjiang good question ... no it wouldn't work precisely with that but if the value was large then you could approximate the behaviour by just dividing the max load value by the api server count. |
Signed-off-by: scratch-ml <limingliang0527@gmail.com>
af06c3c to
1f1cb85
Compare
Signed-off-by: scratch-ml <limingliang0527@gmail.com>
1f1cb85 to
1dd742f
Compare
Signed-off-by: scratch-ml <limingliang0527@gmail.com>
Signed-off-by: scratch-ml <limingliang0527@gmail.com>
NickLucche
left a comment
There was a problem hiding this comment.
Thanks a lot for contributing @scratch-ml !
My opinion on this kind of load-based decisions is that the policy should, in my view, sit one level "in front" of vLLM, with traditional proxies.
In particular, I think this change is mostly useful in a single-server scenario and it loses a lot of its value in a distributed setup where a more structured LB is to be used anyways (eg llm-d load-informed scheduler to name one).
@NickLucche Thank you for your suggestion. I appreciate your perspective on implementing request rejection at the upper-layer proxy being a more elegant approach. That said, I believe this mechanism can still serve as a valuable fallback for scenarios without a sophisticated scheduler infrastructure — for example, where users might rely on a simple round robin strategy with retry mechanisms. To some extent, an API server also acts as a request proxy, and it makes sense for proxies at different levels to possess varying degrees of rate-limiting capabilities.Thank you again for everyone's input. Appreciate the discussion! |
|
Hello @njhill @NickLucche @chaunceyjiang I wanted to check in on this PR. If there's a consensus that this feature isn't needed, I'm ready to close it. Alternatively, if you believe it has value, I'll continue refining it with your feedback until it meets all necessary standards. Appreciate your input and time. |
I agree with @NickLucche’s point. In addition, I think that compared to #21352, the use cases for this PR are quite limited. Whether in distributed scenarios or with multiple API servers, I actually lean more toward an implementation based on a waiting queue. |
Purpose
This PR adds server load limiting functionality to vLLM's OpenAI API server to prevent server overload in production environments.
Problem: Production vLLM deployments can become overwhelmed with too many concurrent requests, leading to poor performance, resource exhaustion, or server crashes. Currently, there's no built-in mechanism to limit concurrent requests and gracefully handle overload situations.
Solution: Add a new
--max-server-loadparameter that works with the existing--enable-server-load-trackingfeature to gracefully reject requests when the server reaches its capacity limit.Changes Made:
max_server_load: Optional[int] = Noneparameter toFrontendArgsclass incli_args.pymax_server_loadstate ininit_app_statefunction inapi_server.pyload_aware_calldecorator inutils.pywith load checking logictest_server_load_limit.pyBenefits:
Test Plan
Unit Tests
Test Result
All unit test cases passed successfully, validating the server load limiting functionality. Load counter management works correctly during both normal operations and exception scenarios, ensuring accurate tracking of concurrent requests.
(Optional) Documentation Update
No documentation files require updates as this is an internal server feature. The functionality is self-documenting through:
--helpoutput with descriptive text