[KVConnector] Allow connector to protect GPU blocks from eviction#33353
[KVConnector] Allow connector to protect GPU blocks from eviction#33353orozery wants to merge 1 commit intovllm-project:mainfrom
Conversation
This commit introduces a new connector API allowing the scheduler-side connector to increase GPU blocks ref-counts, in order to prevent them from evicting. In particular, this is necessary for the case of async offloading of sliding window KV data, as it is automatically freed by the KV cache manager as the window progresses. Signed-off-by: Or Ozeri <oro@il.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable new API for KV connectors to lock and unlock GPU blocks, preventing them from being evicted. This is a crucial feature for scenarios like asynchronous offloading. The changes are well-structured, with clear API definitions in base.py and a solid implementation in the scheduler. The addition of has_work() to the scheduler interface is a logical extension to support this new functionality. The tests are comprehensive and cover the new locking/unlocking mechanism thoroughly. I have one suggestion to improve the robustness of the MultiConnector implementation.
| def report_to_scheduler(self) -> KVConnectorSchedulerOutput | None: | ||
| block_ids_to_lock: list[int] | None = None | ||
| block_ids_to_unlock: list[int] | None = None | ||
| for c in self._connectors: | ||
| output = c.report_to_scheduler() | ||
| if output is None: | ||
| continue | ||
|
|
||
| if output.block_ids_to_lock: | ||
| if not block_ids_to_lock: | ||
| block_ids_to_lock = output.block_ids_to_lock | ||
| else: | ||
| block_ids_to_lock.extend(output.block_ids_to_lock) | ||
|
|
||
| if output.block_ids_to_unlock: | ||
| if not block_ids_to_unlock: | ||
| block_ids_to_unlock = output.block_ids_to_unlock | ||
| else: | ||
| block_ids_to_unlock.extend(output.block_ids_to_unlock) | ||
|
|
||
| if not block_ids_to_lock and not block_ids_to_unlock: | ||
| return None | ||
|
|
||
| return KVConnectorSchedulerOutput( | ||
| block_ids_to_lock=block_ids_to_lock, | ||
| block_ids_to_unlock=block_ids_to_unlock, | ||
| ) |
There was a problem hiding this comment.
The current implementation of report_to_scheduler has a potential for side effects. The line if not block_ids_to_lock: block_ids_to_lock = output.block_ids_to_lock assigns a list from a sub-connector's output directly. If a subsequent sub-connector also returns blocks to lock, block_ids_to_lock.extend(...) will be called, modifying the list from the first sub-connector's output object. This can lead to unexpected behavior if a sub-connector reuses its output object across calls.
A safer approach is to initialize local lists and always use extend to aggregate the block IDs. This avoids modifying child connector outputs and makes the code more robust and easier to reason about.
def report_to_scheduler(self) -> KVConnectorSchedulerOutput | None:
all_block_ids_to_lock: list[int] = []
all_block_ids_to_unlock: list[int] = []
for c in self._connectors:
output = c.report_to_scheduler()
if output is None:
continue
if output.block_ids_to_lock:
all_block_ids_to_lock.extend(output.block_ids_to_lock)
if output.block_ids_to_unlock:
all_block_ids_to_unlock.extend(output.block_ids_to_unlock)
if not all_block_ids_to_lock and not all_block_ids_to_unlock:
return None
return KVConnectorSchedulerOutput(
block_ids_to_lock=all_block_ids_to_lock or None,
block_ids_to_unlock=all_block_ids_to_unlock or None,
)|
This pull request has merge conflicts that must be resolved before it can be |
|
Moving on to another direction (for now): #34805 |
This PR introduces a new connector API allowing the scheduler-side connector to increase GPU blocks ref-counts, in order to prevent them from evicting.
In particular, this is necessary for the case of async offloading of sliding window KV data, as it is automatically freed by the KV cache manager as the window progresses.