Skip to content

[kv_offload]: Add request_finished method to OffloadingManager and decouple store policy#42050

Open
hickeyma wants to merge 5 commits intovllm-project:mainfrom
hickeyma:decouple-store-policy
Open

[kv_offload]: Add request_finished method to OffloadingManager and decouple store policy#42050
hickeyma wants to merge 5 commits intovllm-project:mainfrom
hickeyma:decouple-store-policy

Conversation

@hickeyma
Copy link
Copy Markdown
Contributor

@hickeyma hickeyma commented May 8, 2026

Purpose

Adds a request_finished method to OffloadingManager so implementations can react when a request ends, e.g. to flush a deferred transfer for the last partial block. The method is a no-op default to keep existing subclasses compatible. FilterReusedOffloadingManager delegates it to its backing manager and the connector scheduler calls it alongside the new policy class.

The block-selection logic embedded in _build_store_jobs is pulled out into a new OffloadPolicy class. The existing behaviour becomes StoreOnComputePolicy, which owns the per-request, per-group progress index that previously lived as next_stored_block_idx on RequestGroupState. The renamed RequestKVState (was RequestOffloadState) now only tracks KV state (offload keys, block IDs, in-flight jobs).

The scheduler state types are moved to a new state.py so that policy.py can import them without creating a circular dependency back through scheduler.py.

Partial #33689

Tasks:

  • Add request_finished method to OffloadingManager.
  • Create OffloadPolicy(ABC).get_blocks_to_store. Move existing _get_reqs_to_store to StoreOnComputePolicy(OffloadPolicy). Rename RequestOffloadState to RequestKVState, and move next_stored_block_idx to StoreOnComputePolicy.

Test Plan

VLLM_LOG_STATS_INTERVAL=0.01 vllm bench throughput --model Qwen/Qwen3-14B --kv-offloading-size 10 --disable-hybrid-kv-cache-manager --num-prompts 1000 --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'

Test Result

[...]
Throughput: 3.43 requests/s, 3952.48 total tokens/s, 439.16 output tokens/s
Total num prompt tokens:  1024000
Total num output tokens:  128000

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@hickeyma
Copy link
Copy Markdown
Contributor Author

hickeyma commented May 8, 2026

Supersedes #40625

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an OffloadPolicy abstraction to decouple KV block offloading logic from the scheduler and refactors state management into a new state.py file. It also adds a request_finished hook to both the OffloadingManager and OffloadPolicy for cleanup and deferred transfers. Review feedback suggests optimizing dictionary access in the hot path to reduce allocations, using try...finally blocks for robust cleanup to prevent memory leaks, and addressing inconsistencies in the request_finished API documentation regarding partial block flushing.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/offloading/policy.py Outdated
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/offloading/policy.py Outdated
Comment thread vllm/v1/kv_offload/base.py
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 8, 2026

Hi @hickeyma, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…couple store policy

Adds a `request_finished` method to `OffloadingManager` so implementations can
react when a request ends, e.g. to flush a deferred transfer for the last partial
block. The method is a no-op default to keep existing subclasses compatible.
`FilterReusedOffloadingManager` delegates it to its backing manager, and the
connector scheduler calls it alongside the new policy hook.
The block-selection logic embedded in `_build_store_jobs` is pulled out
into a new `OffloadPolicy` class. The existing behaviour becomes
`StoreOnComputePolicy`, which owns the per-request, per-group progress
index that previously lived as `next_stored_block_idx` on `RequestGroupState`.
The renamed `RequestKVState` (was `RequestOffloadState`) now only tracks
KV state — offload keys, block IDs, in-flight jobs.

The scheduler state types are moved to a new `state.py` so that
`policy.py` can import them without creating a circular dependency back
through `scheduler.py`.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
@hickeyma hickeyma force-pushed the decouple-store-policy branch from 05a0b68 to 53571c2 Compare May 8, 2026 10:37
hickeyma and others added 4 commits May 8, 2026 11:59
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comment:

- vllm-project#42050 (comment)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant