[kv_offload]: Add request_finished method to OffloadingManager and decouple store policy#42050
[kv_offload]: Add request_finished method to OffloadingManager and decouple store policy#42050hickeyma wants to merge 5 commits intovllm-project:mainfrom
Conversation
|
Supersedes #40625 |
There was a problem hiding this comment.
Code Review
This pull request introduces an OffloadPolicy abstraction to decouple KV block offloading logic from the scheduler and refactors state management into a new state.py file. It also adds a request_finished hook to both the OffloadingManager and OffloadPolicy for cleanup and deferred transfers. Review feedback suggests optimizing dictionary access in the hot path to reduce allocations, using try...finally blocks for robust cleanup to prevent memory leaks, and addressing inconsistencies in the request_finished API documentation regarding partial block flushing.
|
Hi @hickeyma, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…couple store policy Adds a `request_finished` method to `OffloadingManager` so implementations can react when a request ends, e.g. to flush a deferred transfer for the last partial block. The method is a no-op default to keep existing subclasses compatible. `FilterReusedOffloadingManager` delegates it to its backing manager, and the connector scheduler calls it alongside the new policy hook. The block-selection logic embedded in `_build_store_jobs` is pulled out into a new `OffloadPolicy` class. The existing behaviour becomes `StoreOnComputePolicy`, which owns the per-request, per-group progress index that previously lived as `next_stored_block_idx` on `RequestGroupState`. The renamed `RequestKVState` (was `RequestOffloadState`) now only tracks KV state — offload keys, block IDs, in-flight jobs. The scheduler state types are moved to a new `state.py` so that `policy.py` can import them without creating a circular dependency back through `scheduler.py`. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
05a0b68 to
53571c2
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comment: - vllm-project#42050 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Purpose
Adds a
request_finishedmethod toOffloadingManagerso implementations can react when a request ends, e.g. to flush a deferred transfer for the last partial block. The method is a no-op default to keep existing subclasses compatible.FilterReusedOffloadingManagerdelegates it to its backing manager and the connector scheduler calls it alongside the new policy class.The block-selection logic embedded in
_build_store_jobsis pulled out into a newOffloadPolicyclass. The existing behaviour becomesStoreOnComputePolicy, which owns the per-request, per-group progress index that previously lived asnext_stored_block_idxonRequestGroupState. The renamedRequestKVState(wasRequestOffloadState) now only tracks KV state (offload keys, block IDs, in-flight jobs).The scheduler state types are moved to a new
state.pyso thatpolicy.pycan import them without creating a circular dependency back throughscheduler.py.Partial #33689
Tasks:
Test Plan
VLLM_LOG_STATS_INTERVAL=0.01 vllm bench throughput --model Qwen/Qwen3-14B --kv-offloading-size 10 --disable-hybrid-kv-cache-manager --num-prompts 1000 --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'Test Result