Skip to content

[WIP][kv_offload] Decouple store policy and request lifecycle from the scheduler#40625

Closed
hickeyma wants to merge 5 commits intovllm-project:mainfrom
hickeyma:add-sched-side-policy-extr
Closed

[WIP][kv_offload] Decouple store policy and request lifecycle from the scheduler#40625
hickeyma wants to merge 5 commits intovllm-project:mainfrom
hickeyma:add-sched-side-policy-extr

Conversation

@hickeyma
Copy link
Copy Markdown
Contributor

@hickeyma hickeyma commented Apr 22, 2026

Purpose

This commit helps separate three concerns that were previously conflated in OffloadingConnectorScheduler: transfer lifecycle, store policy, and request teardown.

It achieves this by:

  • Adding a request_finished() to OffloadingManager so the scheduler can ask the manager whether GPU blocks are safe to free, rather than maintaining that knowledge itself via an inline dict check.
  • Extracting the hardcoded store-on-compute logic into a pluggable interface OffloadPolicy.
    StoreOnComputePolicy deomonstrates the ability to add future policies
    (preemption-only, spillover) which can be injected at construction with no scheduler changes.
  • Moving the per-request store watermark out of the general-purpose RequestKVState struct and
    into the policy that owns it.

Partial #33689

Tasks:

  • Add request_finished method to OffloadingManager.
  • Create OffloadPolicy(ABC).get_blocks_to_store. Move existing _get_reqs_to_store to StoreOnComputePolicy(OffloadPolicy). Rename RequestOffloadState to RequestKVState, and move next_stored_block_idx to StoreOnComputePolicy.

Note: This is based on PR #40538. This PR should only be merged after #40538. #40538 merged

Test Plan

Test Result

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the KV offloading architecture by consolidating core abstractions into a new base module and introducing a policy-driven approach for storage decisions via the OffloadPolicy interface. Key feedback identifies a critical race condition in the scheduler's request teardown logic where the removal of in-flight transfer checks could lead to premature GPU memory freeing. Additionally, the StoreOnComputePolicy incorrectly advances the store watermark for blocks filtered by the manager, which may prevent eligible blocks from being offloaded in subsequent steps.


request_being_stored = req_id in self._reqs_being_stored
return request_being_stored, None
return self.manager.request_finished(req_id), None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for checking in-flight transfers during request teardown has been removed. The previous implementation checked req_id in self._reqs_being_stored to ensure that GPU blocks are not freed while an asynchronous store operation is still reading from them. By delegating this entirely to self.manager.request_finished(req_id), and given that the current OffloadingManager implementations (like CPUOffloadingManager) do not track the scheduler's in-flight transfers, this creates a critical race condition where GPU memory could be freed and reused while a transfer is still in progress. The scheduler must maintain its own check for transfers it has initiated to prevent use-after-free or data corruption.

Suggested change
return self.manager.request_finished(req_id), None
return self.manager.request_finished(req_id) or req_id in self._reqs_being_stored, None

)
continue

self._next_stored_block_idx[req_id][0] = num_blocks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The store watermark is advanced even if no blocks were actually submitted for storage (e.g., if the manager filtered them out via FilterReusedOffloadingManager). This means that blocks skipped due to a reuse threshold will never be reconsidered for offloading in subsequent steps, even after they have been reused enough times to pass the filter. The watermark should only be advanced for blocks that the manager has accepted for storage, or the policy should track skipped blocks to retry them in future steps.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Code review from gemini-code-assist to avoid potential circular
dependency with `vllm.v1.kv_offload.cpu.manager` to add the manager
imports to the relevant methods.

vllm-project#40538 (comment)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
…eduler

This commit helps separate three concerns that were previously conflated in
`OffloadingConnectorScheduler`: transfer lifecycle, store policy, and request teardown.
It achievees this by:

- Adding a request_finished lifecycle hook to OffloadingManager so the scheduler
can ask the manager whether GPU blocks are safe to free, rather than maintaining
that knowledge itself via an inline dict check.

- Extracting the hardcoded store-on-compute logic into a pluggable interface `OffloadPolicy`.
`StoreOnComputePolicy` deomonstrates the ability to add future policies
(preemption-only, spillover) which can be injected at construction with no scheduler changes.
- Moving the per-request store watermark out of the general-purpose `RequestKVState` struct and
into the policy that owns it.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
@hickeyma hickeyma force-pushed the add-sched-side-policy-extr branch from 2775f46 to fec97b6 Compare April 30, 2026 05:35
@hickeyma hickeyma marked this pull request as draft April 30, 2026 05:35
@hickeyma hickeyma changed the title [kv_offload] Decouple store policy and request lifecycle from the scheduler [WIP][kv_offload] Decouple store policy and request lifecycle from the scheduler Apr 30, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hickeyma.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@hickeyma
Copy link
Copy Markdown
Contributor Author

hickeyma commented May 8, 2026

Closing as superseded by #42050

@hickeyma hickeyma closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant