Skip to content

[V1][Scheduler] Reject impossible waiting requests that exceed KV capacity#39828

Closed
Bortlesboat wants to merge 5 commits intovllm-project:mainfrom
Bortlesboat:codex/vllm-scheduler-kv-capacity-39734
Closed

[V1][Scheduler] Reject impossible waiting requests that exceed KV capacity#39828
Bortlesboat wants to merge 5 commits intovllm-project:mainfrom
Bortlesboat:codex/vllm-scheduler-kv-capacity-39734

Conversation

@Bortlesboat
Copy link
Copy Markdown
Contributor

Summary

  • reject waiting requests that cannot fit even in an empty KV cache
  • continue scheduling later waiting requests instead of leaving the waiting queue stuck behind an impossible request
  • emit an explicit scheduler error output for the rejected request
  • add regression coverage for both scheduler_reserve_full_isl=True and scheduler_reserve_full_isl=False

Root cause

A waiting request whose prompt could never fit inside the engine's full KV cache stayed at the head of the waiting queue. The scheduler broke out of the waiting loop without rejecting that request, so later schedulable requests never got a chance to run.

Fixes #39734.

Why this is not duplicate work

  • checked the current discussion on issue #39734 on April 14, 2026
  • searched open PRs for 39734 and for KV cache capacity scheduler deadlock
  • did not find an existing open fix covering this waiting-queue rejection path

Testing

  • PYTHONPATH=$PWD .venv/bin/python -m pytest tests/v1/core/test_scheduler.py -k "test_schedule_rejects_waiting_request_exceeding_kv_capacity or test_schedule or test_stop_via_update_from_output" -v
  • git diff --check

AI assistance

This change was prepared with AI assistance, then reviewed and validated locally before submission.

Signed-off-by: Andrew <andre@Andrews.localdomain>
@mergify mergify Bot added the v1 label Apr 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to identify and reject requests that exceed the total KV cache capacity of the model. It adds a check in the scheduler to determine if a request can fit in an empty cache; if not, the request is finished with an error status and a descriptive stop reason. The changes include refactoring the KVCacheManager to support this check, updating the Scheduler to handle rejected requests via a new pending_outputs buffer, and adding a comprehensive test case to verify this behavior. I have no feedback to provide.

@Bortlesboat Bortlesboat changed the title [codex] Reject impossible waiting requests that exceed KV capacity [V1][Scheduler] Reject impossible waiting requests that exceed KV capacity Apr 20, 2026
@Bortlesboat Bortlesboat marked this pull request as ready for review April 20, 2026 04:45
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@ivanium ivanium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Left a minor comment on nits.

One thing to note is that the root cause is still that the engine should auto-fit max_model_len within the kv cache space, while our current kv cache usage estimation is inaccurate for SWA and mamba state.

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 26, 2026
Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
@njhill
Copy link
Copy Markdown
Member

njhill commented Apr 30, 2026

Thanks @Bortlesboat, but I think we might not want this change. The check that is done is known up-front and so should be used to reject the request before it reaches the scheduler (making sure the effective max_model_len is correct and will always fit in an empty kvcache).

And that should be the case now that #40946 and #41069 are merged.

So it doesn't make sense to add complexity to the scheduler for this imo.

@njhill
Copy link
Copy Markdown
Member

njhill commented Apr 30, 2026

@Bortlesboat of course please let us know if you can reproduce the issue on the latest main branch.

@Bortlesboat
Copy link
Copy Markdown
Contributor Author

Closing — agreed. Read through #40946 and #41069: the SWA admission-gate fix and num_gpu_blocks_override accounting handle this at the right layer (engine validation, not scheduler). The scheduler-side guard here would have masked the real divergence between startup pool sizing and runtime admission that #40946 fixes. Thanks for the pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Scheduler deadlocks when request exceeds KV cache capacity but is within max_model_len

4 participants