[KVConnector] Prevent async loads from exploding the KV Cache#29877
[KVConnector] Prevent async loads from exploding the KV Cache#29877orozery wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces important resource management controls for asynchronous KV cache loading in the scheduler. By accounting for async requests in both max_num_running_reqs and token_budget, it effectively prevents them from overwhelming the GPU KV cache. The implementation is mostly correct and adds a new counter for async loading requests, which is correctly updated during the scheduling lifecycle.
I have one suggestion to improve the token budget logic to ensure full utilization. Overall, this is a valuable improvement for stability and performance.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
c661a16 to
606e9fe
Compare
|
Can you explain what you were observing that trigger this issue? |
I'm testing with the OffloadingConnector, though every connector which uses async load can hit this. |
606e9fe to
c0ba619
Compare
Previously, async loaded requests were unbounded, both in terms of number of requests and number of tokens. This commit accounts async requests for both max_num_running_reqs and token_budget. This will prevent cases where async loaded requests will unnecessarily fill up the entire GPU KV cache. Signed-off-by: Or Ozeri <oro@il.ibm.com>
c0ba619 to
f2e3c57
Compare
|
@njhill - can you review this one? |
|
I feel like the current implementation of this PR is not sufficient, this is why I turned it into a draft. |
|
With the recent support for resuming preempted requests from CPU, it seems the case describe here is not so bad. |
Previously, async loaded requests were unbounded,
both in terms of number of requests and number of tokens.
This PR accounts async requests for both
max_num_running_reqsandtoken_budget.This will prevent cases where async loaded requests will unnecessarily fill up the entire GPU KV cache.