[Core] Add max-waiting-queue-length parameter to reject requests when queue is full#21271
[Core] Add max-waiting-queue-length parameter to reject requests when queue is full#21271chudyandrej wants to merge 3 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature for managing server load by limiting the waiting queue length. My review focuses on improving the correctness and robustness of the error handling. I've identified some unreachable code in the exception handling logic and a bug where streaming responses would not receive the correct HTTP 503 error when the queue is full. Addressing these points will make the implementation more robust and prevent unexpected behavior in production.
da2a499 to
7aae8e9
Compare
Signed-off-by: Andrej Chudý <achudy03@gmail.com> Signed-off-by: Andrej Chudý <achudy03@gmail.com>
…tring Signed-off-by: Andrej Chudý <achudy03@gmail.com>
7aae8e9 to
ab4d940
Compare
|
Hello. Thank you for your PR. V0 is in process of being deprecated. I think this is a useful feature, so I would be happy to review it in the V1 codepath. |
|
About two months ago, I submitted an RFC: #18826. |
da8731c to
7bf949c
Compare
Signed-off-by: Andrej Chudý <achudy03@gmail.com>
7bf949c to
891d9ba
Compare
|
@chaunceyjiang Thanks for your comment; this complexity I totally missed. That's indeed a building block that is currently missing. Do you have some quick workaround in your mind that can unblock this PR? Or do you believe that cross-process error reporting needs to be implemented first? |
|
I can imagine a counter on the serving layer. Something like |
|
Hi @chudyandrej, I’ve submitted a PR (#21352) that fully implements error propagation — custom errors can now be passed from P1 to P0. If you don’t mind, I’d like to add you to the co-authors list. Then we can shift our focus to reviewing #21352. What do you think? |
Sounds good. Okay, so let's close this one. |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Implements
--max-waiting-queue-lengthparameter to allow vLLM to reject new requests when the waiting queue reaches a specified limit, providing better load management for production environments.Addresses: #2901 #3168 #4190
Key changes:
max_waiting_queue_lengthfield toSchedulerConfigwith CLI argument supportScheduler.add_seq_group()with customSchedulerWaitingQueueFullErrorserving_chat.py,serving_completion.py,serving_engine.py)AsyncLLMEngineBenefits:
Test Plan