Skip to content

[kv_offload+HMA][0/N]: Support block-level preemption handling#34805

Merged
orozery merged 3 commits intovllm-project:mainfrom
orozery:kv-connector-preemptions
Mar 18, 2026
Merged

[kv_offload+HMA][0/N]: Support block-level preemption handling#34805
orozery merged 3 commits intovllm-project:mainfrom
orozery:kv-connector-preemptions

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Feb 18, 2026

This PR changes the handle_preemptions connector API function to support handling of arbitrary events via KVConnectorMetadata.
Specifically, this will allow handling of sliding-window layer blocks which can be evicted from the GPU KV cache while still being saved by a connector.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the handle_preemptions API to use KVConnectorMetadata, which is a good change for flexibility. The implementation looks mostly correct, propagating the change through the connector hierarchy and updating tests. However, I found a redundant call to handle_preemptions in gpu_model_runner.py, which should be removed to avoid potential side effects and keep the logic clean. Please see my detailed comment.

Comment on lines +3334 to +3337
if has_kv_transfer_group():
kv_connector_metadata = scheduler_output.kv_connector_metadata
assert kv_connector_metadata is not None
get_kv_transfer_group().handle_preemptions(kv_connector_metadata)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The handle_preemptions method is called here, but it's also called within ActiveKVConnector.pre_forward, which is invoked later in this execute_model function (via set_forward_context or kv_connector_no_forward). This results in handle_preemptions being called twice in each step.

While the current implementations appear to be idempotent, this redundancy can be confusing and might lead to bugs if a future connector's handle_preemptions is not idempotent. To centralize the logic, this call should be removed, relying on the one inside ActiveKVConnector.pre_forward.

Copy link
Copy Markdown
Collaborator Author

@orozery orozery Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK pre_forward is only called in model runner v2.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orozery Does the code make this obvious or enforced? If it possible that it could be called twice?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All connector functions have 2 call locations, one for each model runner.
For a specific run, only one model runner will be used (either v1 or v2), so it's not possible functions will be called twice.

Copy link
Copy Markdown
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks @orozery. Some comments inline to address.

)
if has_kv_transfer_group():
kv_connector_metadata = scheduler_output.kv_connector_metadata
assert kv_connector_metadata is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this assert here? Previously, it checked for ids were available.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, previously handle_preemptions only got a a set of preempted requests IDs, whereas now it gets KVConnectorMetadata.
has_kv_transfer_group() guarantees that kv_connector_metadata is not None (since build_connector_metadata is called on each step), so that assert is fine.
BTW the same assert also exists in the forward pass (at KVConnectorModelRunnerMixin._get_kv_connector_output).

Comment on lines +3334 to +3337
if has_kv_transfer_group():
kv_connector_metadata = scheduler_output.kv_connector_metadata
assert kv_connector_metadata is not None
get_kv_transfer_group().handle_preemptions(kv_connector_metadata)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orozery Does the code make this obvious or enforced? If it possible that it could be called twice?

self._register_handlers(kv_caches, attn_backends)

def handle_preemptions(self, preempted_req_ids: set[str]):
def handle_preemptions(self, kv_connector_metadata: OffloadingConnectorMetadata):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be documented that this is a breaking API change?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fresh API that I recently introduced. It's not used by any in-tree connector, and most-likely not used at all.
I don't see any benefit from documenting it.

@orozery orozery force-pushed the kv-connector-preemptions branch from 43254e1 to dc73aae Compare February 22, 2026 07:42
Copy link
Copy Markdown
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @orozery

@orozery orozery changed the title [KVConnector]: Support block-level preemption handling [kv_offload+HMA][0/N]: Support block-level preemption handling Mar 10, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 10, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 10, 2026
This commit changes the handle_preemptions connector API function to
support handling of arbitrary events via KVConnectorMetadata.
Specifically, this will allow handling of sliding-window layer blocks which
can be evicted from the GPU KV cache while still being saved by a connector.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery force-pushed the kv-connector-preemptions branch from dc73aae to cb1180e Compare March 10, 2026 11:27
@mergify mergify bot removed the needs-rebase label Mar 10, 2026
Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orozery I am not super thrilled about breaking interface but iiuc:

  • this callback is quite niche to offloading and also quite new
  • the info we're now passing is a "super-set" of what we were passing previously

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Mar 17, 2026

  • the info we're now passing is a "super-set" of what we were passing previously

It's not exactly a super-set, but it could be a super-set since KVConnectorMetadata is built from SchedulerOutput, which also includes the previously passed preempted_req_ids.
So anyhow, any implementation using the old API can be surely converted to using the new API.

@orozery orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 17, 2026
@orozery orozery merged commit fcf0687 into vllm-project:main Mar 18, 2026
62 checks passed
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…project#34805)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
xiaguan added a commit to novitalabs/pegaflow that referenced this pull request Apr 3, 2026
…nge (#190)

## Summary

- vLLM [#34805](vllm-project/vllm#34805) changed
`handle_preemptions` signature from `set[str]` to `KVConnectorMetadata`
- Add runtime type check in `__init__.py` to support both old and new
vLLM versions
- Carry `preempted_req_ids` through `PegaConnectorMetadata` for new vLLM
path

## Test plan

- [x] Deploy with new vLLM (v0.17.2rc1+) — verify preemption no longer
crashes with `TypeError`
- [x] Deploy with old vLLM — verify preemption still works via
`set[str]` path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants