[V1][Scheduler] Reject impossible waiting requests that exceed KV capacity#39828
[V1][Scheduler] Reject impossible waiting requests that exceed KV capacity#39828Bortlesboat wants to merge 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces logic to identify and reject requests that exceed the total KV cache capacity of the model. It adds a check in the scheduler to determine if a request can fit in an empty cache; if not, the request is finished with an error status and a descriptive stop reason. The changes include refactoring the KVCacheManager to support this check, updating the Scheduler to handle rejected requests via a new pending_outputs buffer, and adding a comprehensive test case to verify this behavior. I have no feedback to provide.
There was a problem hiding this comment.
LGTM in general. Left a minor comment on nits.
One thing to note is that the root cause is still that the engine should auto-fit max_model_len within the kv cache space, while our current kv cache usage estimation is inaccurate for SWA and mamba state.
Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
|
Thanks @Bortlesboat, but I think we might not want this change. The check that is done is known up-front and so should be used to reject the request before it reaches the scheduler (making sure the effective max_model_len is correct and will always fit in an empty kvcache). And that should be the case now that #40946 and #41069 are merged. So it doesn't make sense to add complexity to the scheduler for this imo. |
|
@Bortlesboat of course please let us know if you can reproduce the issue on the latest main branch. |
|
Closing — agreed. Read through #40946 and #41069: the SWA admission-gate fix and |
Summary
scheduler_reserve_full_isl=Trueandscheduler_reserve_full_isl=FalseRoot cause
A waiting request whose prompt could never fit inside the engine's full KV cache stayed at the head of the waiting queue. The scheduler broke out of the waiting loop without rejecting that request, so later schedulable requests never got a chance to run.
Fixes #39734.
Why this is not duplicate work
#39734on April 14, 202639734and forKV cache capacity scheduler deadlockTesting
PYTHONPATH=$PWD .venv/bin/python -m pytest tests/v1/core/test_scheduler.py -k "test_schedule_rejects_waiting_request_exceeding_kv_capacity or test_schedule or test_stop_via_update_from_output" -vgit diff --checkAI assistance
This change was prepared with AI assistance, then reviewed and validated locally before submission.