[WIP][kv_offload] Decouple store policy and request lifecycle from the scheduler#40625
[WIP][kv_offload] Decouple store policy and request lifecycle from the scheduler#40625hickeyma wants to merge 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the KV offloading architecture by consolidating core abstractions into a new base module and introducing a policy-driven approach for storage decisions via the OffloadPolicy interface. Key feedback identifies a critical race condition in the scheduler's request teardown logic where the removal of in-flight transfer checks could lead to premature GPU memory freeing. Additionally, the StoreOnComputePolicy incorrectly advances the store watermark for blocks filtered by the manager, which may prevent eligible blocks from being offloaded in subsequent steps.
|
|
||
| request_being_stored = req_id in self._reqs_being_stored | ||
| return request_being_stored, None | ||
| return self.manager.request_finished(req_id), None |
There was a problem hiding this comment.
The logic for checking in-flight transfers during request teardown has been removed. The previous implementation checked req_id in self._reqs_being_stored to ensure that GPU blocks are not freed while an asynchronous store operation is still reading from them. By delegating this entirely to self.manager.request_finished(req_id), and given that the current OffloadingManager implementations (like CPUOffloadingManager) do not track the scheduler's in-flight transfers, this creates a critical race condition where GPU memory could be freed and reused while a transfer is still in progress. The scheduler must maintain its own check for transfers it has initiated to prevent use-after-free or data corruption.
| return self.manager.request_finished(req_id), None | |
| return self.manager.request_finished(req_id) or req_id in self._reqs_being_stored, None |
| ) | ||
| continue | ||
|
|
||
| self._next_stored_block_idx[req_id][0] = num_blocks |
There was a problem hiding this comment.
The store watermark is advanced even if no blocks were actually submitted for storage (e.g., if the manager filtered them out via FilterReusedOffloadingManager). This means that blocks skipped due to a reuse threshold will never be reconsidered for offloading in subsequent steps, even after they have been reused enough times to pass the filter. The watermark should only be advanced for blocks that the manager has accepted for storage, or the policy should track skipped blocks to retry them in future steps.
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Code review from gemini-code-assist to avoid potential circular dependency with `vllm.v1.kv_offload.cpu.manager` to add the manager imports to the relevant methods. vllm-project#40538 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
…eduler This commit helps separate three concerns that were previously conflated in `OffloadingConnectorScheduler`: transfer lifecycle, store policy, and request teardown. It achievees this by: - Adding a request_finished lifecycle hook to OffloadingManager so the scheduler can ask the manager whether GPU blocks are safe to free, rather than maintaining that knowledge itself via an inline dict check. - Extracting the hardcoded store-on-compute logic into a pluggable interface `OffloadPolicy`. `StoreOnComputePolicy` deomonstrates the ability to add future policies (preemption-only, spillover) which can be injected at construction with no scheduler changes. - Moving the per-request store watermark out of the general-purpose `RequestKVState` struct and into the policy that owns it. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
2775f46 to
fec97b6
Compare
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing as superseded by #42050 |
Purpose
This commit helps separate three concerns that were previously conflated in
OffloadingConnectorScheduler: transfer lifecycle, store policy, and request teardown.It achieves this by:
request_finished()toOffloadingManagerso the scheduler can ask the manager whether GPU blocks are safe to free, rather than maintaining that knowledge itself via an inline dict check.OffloadPolicy.StoreOnComputePolicydeomonstrates the ability to add future policies(preemption-only, spillover) which can be injected at construction with no scheduler changes.
RequestKVStatestruct andinto the policy that owns it.
Partial #33689
Tasks:
Note:
This is based on PR #40538. This PR should only be merged after #40538.#40538 mergedTest Plan
Test Result