Skip to content

[KVConnector] Prevent async loads from exploding the KV Cache#29877

Closed
orozery wants to merge 1 commit intovllm-project:mainfrom
orozery:scheduler-limit-async-loads
Closed

[KVConnector] Prevent async loads from exploding the KV Cache#29877
orozery wants to merge 1 commit intovllm-project:mainfrom
orozery:scheduler-limit-async-loads

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Dec 2, 2025

Previously, async loaded requests were unbounded,
both in terms of number of requests and number of tokens.
This PR accounts async requests for both max_num_running_reqs and token_budget.
This will prevent cases where async loaded requests will unnecessarily fill up the entire GPU KV cache.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces important resource management controls for asynchronous KV cache loading in the scheduler. By accounting for async requests in both max_num_running_reqs and token_budget, it effectively prevents them from overwhelming the GPU KV cache. The implementation is mostly correct and adds a new counter for async loading requests, which is correctly updated during the scheduling lifecycle.

I have one suggestion to improve the token budget logic to ensure full utilization. Overall, this is a valuable improvement for stability and performance.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@orozery orozery force-pushed the scheduler-limit-async-loads branch 2 times, most recently from c661a16 to 606e9fe Compare December 2, 2025 13:07
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Can you explain what you were observing that trigger this issue?

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Dec 2, 2025

Can you explain what you were observing that trigger this issue?

I'm testing with the OffloadingConnector, though every connector which uses async load can hit this.
Running a workload with many concurrent requests yielding external tokens, the scheduler tries to allocate GPU blocks to accommodate all of these external tokens.
There is no limit the scheduler imposes on that.
Say you have 10000 requests in self.waiting, all of them yielding external tokens.
In a single scheduler.schedule() step, you will allocate GPU blocks for all of these requests, until the KVCacheManager fails to allocate.
So potentially you allocated 1000's of requests.
But you cannot decode them all at the same time since you don't have enough GPU KV cache.
So you will get preemptions, and even worse, you will preempt requests that finished async loading (and turned into running state), while still asynchronously loading new requests.

@orozery orozery force-pushed the scheduler-limit-async-loads branch from 606e9fe to c0ba619 Compare December 2, 2025 16:16
Previously, async loaded requests were unbounded,
both in terms of number of requests and number of tokens.
This commit accounts async requests for both max_num_running_reqs
and token_budget.
This will prevent cases where async loaded requests will unnecessarily fill up the entire
GPU KV cache.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery force-pushed the scheduler-limit-async-loads branch from c0ba619 to f2e3c57 Compare December 2, 2025 19:37
@orozery orozery marked this pull request as draft December 2, 2025 20:04
@robertgshaw2-redhat robertgshaw2-redhat changed the title Scheduler: Prevent async loads from exploding the KV Cache [KVConnector]Prevent async loads from exploding the KV Cache Dec 4, 2025
@robertgshaw2-redhat robertgshaw2-redhat changed the title [KVConnector]Prevent async loads from exploding the KV Cache [KVConnector] Prevent async loads from exploding the KV Cache Dec 4, 2025
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@njhill - can you review this one?

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Dec 4, 2025

I feel like the current implementation of this PR is not sufficient, this is why I turned it into a draft.
Still thinking it over.
Would be happy to brainstorm over it offline, but in any case will update here.

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Jan 11, 2026

With the recent support for resuming preempted requests from CPU, it seems the case describe here is not so bad.
There are possible heuristic optimizations to balance off KV cache usage between async loads and running requests.
But I think we can defer this until we have a real justified use-case.

@orozery orozery closed this Jan 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants