[KVConnector] Prevent async loads from exploding the KV Cache by orozery · Pull Request #29877 · vllm-project/vllm

orozery · 2025-12-02T12:50:40Z

Previously, async loaded requests were unbounded,
both in terms of number of requests and number of tokens.
This PR accounts async requests for both max_num_running_reqs and token_budget.
This will prevent cases where async loaded requests will unnecessarily fill up the entire GPU KV cache.

gemini-code-assist

Code Review

This pull request introduces important resource management controls for asynchronous KV cache loading in the scheduler. By accounting for async requests in both max_num_running_reqs and token_budget, it effectively prevents them from overwhelming the GPU KV cache. The implementation is mostly correct and adds a new counter for async loading requests, which is correctly updated during the scheduling lifecycle.

I have one suggestion to improve the token budget logic to ensure full utilization. Overall, this is a valuable improvement for stability and performance.

vllm/v1/core/sched/scheduler.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/core/sched/scheduler.py

robertgshaw2-redhat · 2025-12-02T14:09:40Z

Can you explain what you were observing that trigger this issue?

orozery · 2025-12-02T15:30:11Z

Can you explain what you were observing that trigger this issue?

I'm testing with the OffloadingConnector, though every connector which uses async load can hit this.
Running a workload with many concurrent requests yielding external tokens, the scheduler tries to allocate GPU blocks to accommodate all of these external tokens.
There is no limit the scheduler imposes on that.
Say you have 10000 requests in self.waiting, all of them yielding external tokens.
In a single scheduler.schedule() step, you will allocate GPU blocks for all of these requests, until the KVCacheManager fails to allocate.
So potentially you allocated 1000's of requests.
But you cannot decode them all at the same time since you don't have enough GPU KV cache.
So you will get preemptions, and even worse, you will preempt requests that finished async loading (and turned into running state), while still asynchronously loading new requests.

Previously, async loaded requests were unbounded, both in terms of number of requests and number of tokens. This commit accounts async requests for both max_num_running_reqs and token_budget. This will prevent cases where async loaded requests will unnecessarily fill up the entire GPU KV cache. Signed-off-by: Or Ozeri <oro@il.ibm.com>

robertgshaw2-redhat · 2025-12-04T15:30:43Z

@njhill - can you review this one?

orozery · 2025-12-04T16:14:24Z

I feel like the current implementation of this PR is not sufficient, this is why I turned it into a draft.
Still thinking it over.
Would be happy to brainstorm over it offline, but in any case will update here.

orozery · 2026-01-11T08:18:40Z

With the recent support for resuming preempted requests from CPU, it seems the case describe here is not so bad.
There are possible heuristic optimizations to balance off KV cache usage between async loads and running requests.
But I think we can defer this until we have a real justified use-case.

orozery requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners December 2, 2025 12:50

mergify bot added the v1 label Dec 2, 2025

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 2, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

orozery force-pushed the scheduler-limit-async-loads branch 2 times, most recently from c661a16 to 606e9fe Compare December 2, 2025 13:07

orozery force-pushed the scheduler-limit-async-loads branch from 606e9fe to c0ba619 Compare December 2, 2025 16:16

orozery force-pushed the scheduler-limit-async-loads branch from c0ba619 to f2e3c57 Compare December 2, 2025 19:37

orozery marked this pull request as draft December 2, 2025 20:04

robertgshaw2-redhat changed the title ~~Scheduler: Prevent async loads from exploding the KV Cache~~ [KVConnector]Prevent async loads from exploding the KV Cache Dec 4, 2025

robertgshaw2-redhat changed the title ~~[KVConnector]Prevent async loads from exploding the KV Cache~~ [KVConnector] Prevent async loads from exploding the KV Cache Dec 4, 2025

robertgshaw2-redhat assigned njhill Dec 4, 2025

orozery mentioned this pull request Dec 13, 2025

CPU KV Offloading: Use more CUDA streams #29013

Merged

orozery closed this Jan 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KVConnector] Prevent async loads from exploding the KV Cache#29877

[KVConnector] Prevent async loads from exploding the KV Cache#29877
orozery wants to merge 1 commit intovllm-project:mainfrom
orozery:scheduler-limit-async-loads

orozery commented Dec 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 2, 2025

Uh oh!

orozery commented Dec 2, 2025

Uh oh!

robertgshaw2-redhat commented Dec 4, 2025

Uh oh!

orozery commented Dec 4, 2025

Uh oh!

orozery commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

orozery commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 2, 2025

Uh oh!

orozery commented Dec 2, 2025

Uh oh!

robertgshaw2-redhat commented Dec 4, 2025

Uh oh!

orozery commented Dec 4, 2025

Uh oh!

orozery commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

orozery commented Dec 2, 2025 •

edited by github-actions bot

Loading