[kv_offload+HMA][0/N]: Support block-level preemption handling#34805
[kv_offload+HMA][0/N]: Support block-level preemption handling#34805orozery merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the handle_preemptions API to use KVConnectorMetadata, which is a good change for flexibility. The implementation looks mostly correct, propagating the change through the connector hierarchy and updating tests. However, I found a redundant call to handle_preemptions in gpu_model_runner.py, which should be removed to avoid potential side effects and keep the logic clean. Please see my detailed comment.
| if has_kv_transfer_group(): | ||
| kv_connector_metadata = scheduler_output.kv_connector_metadata | ||
| assert kv_connector_metadata is not None | ||
| get_kv_transfer_group().handle_preemptions(kv_connector_metadata) |
There was a problem hiding this comment.
The handle_preemptions method is called here, but it's also called within ActiveKVConnector.pre_forward, which is invoked later in this execute_model function (via set_forward_context or kv_connector_no_forward). This results in handle_preemptions being called twice in each step.
While the current implementations appear to be idempotent, this redundancy can be confusing and might lead to bugs if a future connector's handle_preemptions is not idempotent. To centralize the logic, this call should be removed, relying on the one inside ActiveKVConnector.pre_forward.
There was a problem hiding this comment.
AFAIK pre_forward is only called in model runner v2.
There was a problem hiding this comment.
@orozery Does the code make this obvious or enforced? If it possible that it could be called twice?
There was a problem hiding this comment.
All connector functions have 2 call locations, one for each model runner.
For a specific run, only one model runner will be used (either v1 or v2), so it's not possible functions will be called twice.
| ) | ||
| if has_kv_transfer_group(): | ||
| kv_connector_metadata = scheduler_output.kv_connector_metadata | ||
| assert kv_connector_metadata is not None |
There was a problem hiding this comment.
Should this assert here? Previously, it checked for ids were available.
There was a problem hiding this comment.
Right, previously handle_preemptions only got a a set of preempted requests IDs, whereas now it gets KVConnectorMetadata.
has_kv_transfer_group() guarantees that kv_connector_metadata is not None (since build_connector_metadata is called on each step), so that assert is fine.
BTW the same assert also exists in the forward pass (at KVConnectorModelRunnerMixin._get_kv_connector_output).
| if has_kv_transfer_group(): | ||
| kv_connector_metadata = scheduler_output.kv_connector_metadata | ||
| assert kv_connector_metadata is not None | ||
| get_kv_transfer_group().handle_preemptions(kv_connector_metadata) |
There was a problem hiding this comment.
@orozery Does the code make this obvious or enforced? If it possible that it could be called twice?
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
| self._register_handlers(kv_caches, attn_backends) | ||
|
|
||
| def handle_preemptions(self, preempted_req_ids: set[str]): | ||
| def handle_preemptions(self, kv_connector_metadata: OffloadingConnectorMetadata): |
There was a problem hiding this comment.
Should it be documented that this is a breaking API change?
There was a problem hiding this comment.
This is a fresh API that I recently introduced. It's not used by any in-tree connector, and most-likely not used at all.
I don't see any benefit from documenting it.
43254e1 to
dc73aae
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
This commit changes the handle_preemptions connector API function to support handling of arbitrary events via KVConnectorMetadata. Specifically, this will allow handling of sliding-window layer blocks which can be evicted from the GPU KV cache while still being saved by a connector. Signed-off-by: Or Ozeri <oro@il.ibm.com>
dc73aae to
cb1180e
Compare
NickLucche
left a comment
There was a problem hiding this comment.
@orozery I am not super thrilled about breaking interface but iiuc:
- this callback is quite niche to offloading and also quite new
- the info we're now passing is a "super-set" of what we were passing previously
It's not exactly a super-set, but it could be a super-set since |
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…nge (#190) ## Summary - vLLM [#34805](vllm-project/vllm#34805) changed `handle_preemptions` signature from `set[str]` to `KVConnectorMetadata` - Add runtime type check in `__init__.py` to support both old and new vLLM versions - Carry `preempted_req_ids` through `PegaConnectorMetadata` for new vLLM path ## Test plan - [x] Deploy with new vLLM (v0.17.2rc1+) — verify preemption no longer crashes with `TypeError` - [x] Deploy with old vLLM — verify preemption still works via `set[str]` path 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This PR changes the
handle_preemptionsconnector API function to support handling of arbitrary events viaKVConnectorMetadata.Specifically, this will allow handling of sliding-window layer blocks which can be evicted from the GPU KV cache while still being saved by a connector.