[Core][Feat] Add max-waiting-queue-length parameter to reject requests when waiting queue is full#27064
[Core][Feat] Add max-waiting-queue-length parameter to reject requests when waiting queue is full#27064chaunceyjiang wants to merge 33 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.
|
/cc @robertgshaw2-redhat @njhill @hmellor PTAL. |
|
Nice feature, we needed this and implemented this! However, we found a bug when using LoRA adapters: the request removal from the running queue fails when the request is aborted. Fix: Use vllm/v1/metrics/stats.py
def finish_request(self, req_state: 'RequestState'):
if req_state.lora_name is None:
return
lora_stats = self.lora_name_to_stats[req_state.lora_name]
lora_stats.waiting_requests.discard(req_state.request_id)
lora_stats.running_requests.discard(req_state.request_id)The issue is that |
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
I’ve implemented another PR based on your suggestions. The new implementation avoids the input/output processing. Could you take another look? |
…s when waiting queue is full Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Purpose
Feature implementation #18826
CLOSE #18826
CLOSE #21352
Test Plan
Test Result
TODO
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit c67d75018f29288602ca75f8438bf0f0e0d02aa1. Configure here.
Note
Implements a hard cap on the scheduler waiting queue and surfaces rejections to clients.
SchedulerConfig.max_waiting_queue_lengthwith CLI--max-waiting-queue-length; plumbs throughEngineArgsinto engine configREJECTEDevent, and returns outputs with finish_reasonrejectedFinishReason.REJECTED("rejected") andEngineCoreEventType.REJECTEDrejectedtoGenerationErrorwithServiceUnavailableError(HTTP 503) and preserves error type/status in responsesWritten by Cursor Bugbot for commit c67d75018f29288602ca75f8438bf0f0e0d02aa1. This will update automatically on new commits. Configure here.