Skip to content

[KVConnector] OffloadingConnector: Fix bug in handling of preemptions#29870

Merged
njhill merged 1 commit intovllm-project:mainfrom
orozery:offloading-connector-preemptions
Jan 11, 2026
Merged

[KVConnector] OffloadingConnector: Fix bug in handling of preemptions#29870
njhill merged 1 commit intovllm-project:mainfrom
orozery:offloading-connector-preemptions

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Dec 2, 2025

This PR fixes the OffloadingConnector to fail any stores for a preempted request, as well as allowing to store the result of the request re-computed blocks.


Note

Cursor Bugbot is generating a summary for commit 77bab86e5a415ec6ae3be7ffa56aa6f3fe24c149. Configure here.


Note

Ensures safe handling of preempted requests and completes/flushes in-flight KV offload transfers.

  • Adds handle_preemptions() to KV connector base; OffloadingConnector worker now waits on pending store job IDs for preempted requests, and scheduler completes stores for preempted IDs in build_connector_meta()
  • Calls handle_preemptions() from gpu_model_runner before blocks are overwritten
  • Extends OffloadingHandler/OffloadingWorker with wait(job_ids); CPU↔GPU handler tracks CUDA events per job and implements blocking wait()
  • Worker-side OffloadingConnector defers store submission to next step but can now force-wait on preemption; finished-getters unchanged besides plumbing
  • Tests: refactor mocks to track job/spec state and flushed jobs; add test_request_preemption validating flushed block indexes and re-load/store behavior; minor test harness updates

Written by Cursor Bugbot for commit 3f960add3709a8ba8bbf258c9039e58ce19eb698. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 5aa68e4. Configure here.


Note

Fixes offload preemption handling and adds blocking wait support across KV transfer.

  • Add handle_preemptions() to KV connector base; invoke from gpu_model_runner before blocks are overwritten
  • Scheduler side: in OffloadingConnectorScheduler.build_connector_meta(), mark in-flight stores complete for preempted_req_ids
  • Worker side: OffloadingConnector now submits deferred store jobs and wait()s on their job IDs for preempted requests
  • Extend OffloadingHandler/OffloadingWorker with wait(job_ids); CPU↔GPU handler tracks CUDA events per job and implements blocking wait()
  • Tests: refactor mocks to track job/spec state and flushed jobs; add test_request_preemption; minor runner/test harness updates

Written by Cursor Bugbot for commit 5aa68e4. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for handling preempted requests in the OffloadingConnector. The changes ensure that when a request is preempted, any ongoing or pending KV cache store operations for that request are correctly failed. This is achieved by clearing the relevant state for the preempted request and explicitly calling complete_store with success=False. The logic appears sound and correctly addresses the bug, preventing potential issues with stale or incomplete offloaded KV cache data. The changes are well-contained and follow the existing code structure.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Do we need to cherry pick any of this back to RHAIIS releases?

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Dec 2, 2025

Do we need to cherry pick any of this back to RHAIIS releases?

I think it's a good idea, though I don't think it's a common bug to hit.
This one is much more important bugfix: #28951.
This one #27743 is also critical as it speed-ups CPU offloading in an order of magnitude, which in turn helps reduce stress of the GPU KV cache capacity utilization.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@njhill can you review this one?

@robertgshaw2-redhat robertgshaw2-redhat changed the title OffloadingConnector: Fix bug in handling of preemptions [KVConnector] OffloadingConnector: Fix bug in handling of preemptions Dec 4, 2025
Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orozery, looks good to me. Is there a reasonable way this could be covered in a test?

Slightly-related: Do we load from the connector when resuming/recomputing preempted requests?

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Dec 4, 2025

Thanks @orozery, looks good to me. Is there a reasonable way this could be covered in a test?

Sure, I will add one next week.

Slightly-related: Do we load from the connector when resuming/recomputing preempted requests?

Unfortunately our scheduler-side connector only gets informed that blocks were offloaded GPU->CPU only when the request is done (using KVConnectorOutput.finished_sending).
I am planning to add KVConnectorWorkerMeta to KVConnectorOutput which should enable us to handle it.

@orozery orozery force-pushed the offloading-connector-preemptions branch from 0a01740 to be397c1 Compare December 10, 2025 15:07
@mergify mergify bot added the v1 label Dec 10, 2025
@orozery orozery force-pushed the offloading-connector-preemptions branch from be397c1 to ee01057 Compare December 10, 2025 15:24
@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Dec 10, 2025

@njhill I changed this PR to flush pending writes (GPU->CPU) for preempted requests instead of discarding them.
This now allows to re-load the requests from CPU when they are re-scheduled, instead of re-computing.
Also added a unit test.

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orozery, sorry for the delay, looks great to me.

@@ -461,6 +472,11 @@ def register_cross_layers_kv_cache(
self._register_handlers(kv_caches, attn_backends)

def start_load_kv(self, metadata: OffloadingConnectorMetadata):
for req_id in metadata.reqs_to_flush or ():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orozery could we also add resumed requests to the metadata, and trigger the waiting here based on those instead?

Suggested change
for req_id in metadata.reqs_to_flush or ():
for req_id in metadata.resumed_reqs or ():

This should be preferable perf-wise I think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to flush requests at the moment they are preempted.
This is what we do here, setting reqs_to_flush to the list of preempted requests.
Resumed requests are handled transparently by the scheduler calling get_num_new_matched_tokens and starting an async load (cpu->gpu) of the resumed requests.

@orozery orozery force-pushed the offloading-connector-preemptions branch from 8521295 to 8e52694 Compare December 13, 2025 18:30
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 15, 2025
@orozery orozery force-pushed the offloading-connector-preemptions branch from 8e52694 to f4e491f Compare December 15, 2025 04:46
@mergify mergify bot removed the needs-rebase label Dec 15, 2025
@orozery orozery force-pushed the offloading-connector-preemptions branch 2 times, most recently from 8af44a3 to 0ff0cdc Compare December 29, 2025 17:45
@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Dec 29, 2025

Thanks @orozery, sorry for the delay, looks great to me.

@njhill I got a report from a user that suggests that the fix is not sufficient.
I deep-dived to the execute_model function in the gpu_model_runner.py and from what I understand, blocks that belong to pre-empted requests can get overwritten in _prepare_inputs, before we get to flush them to CPU (our flush was called in start_load_kv).
To overcome this, I introduced a new handle_preemptions worker-side connector API that is called at the beginning of the execute_model function, before _prepare_inputs gets into play.

@orozery orozery force-pushed the offloading-connector-preemptions branch from 0ff0cdc to 2d3b60c Compare December 30, 2025 19:56
@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Jan 1, 2026

Should be good now. User confirmed preempted requests are successfully re-loaded from CPU (assuming #31583 as well).

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orozery, looks great!

Just had one tiny suggestion, since you'll need to rebase anyhow.

Sorry for taking so long to get back to these.

Would be good for @ApostaC to check it too but don't need to hold it up for that.

@orozery orozery force-pushed the offloading-connector-preemptions branch from 2d3b60c to 77bab86 Compare January 10, 2026 16:32
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 10, 2026
@njhill njhill enabled auto-merge (squash) January 10, 2026 19:30
@njhill njhill added this to the v0.14.0 milestone Jan 10, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 10, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 10, 2026
@njhill njhill disabled auto-merge January 11, 2026 03:26
@njhill njhill enabled auto-merge (squash) January 11, 2026 03:26
@mergify mergify bot removed the needs-rebase label Jan 11, 2026
This commit fixes the OffloadingConnector to flush preempted requests to the offloading backend.
Without flushing, the GPU KV data may be overwritten before the offloading completes,
which could yield KV data corruption.
Additionally, this fix allows re-scheduled pre-empted requests to load back KV data from the
offloading backend.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
auto-merge was automatically disabled January 11, 2026 04:39

Head branch was pushed to by a user without write access

@orozery orozery force-pushed the offloading-connector-preemptions branch from 3f960ad to 5aa68e4 Compare January 11, 2026 04:39
@njhill njhill enabled auto-merge (squash) January 11, 2026 06:34
@njhill njhill merged commit 4c16ba6 into vllm-project:main Jan 11, 2026
55 checks passed
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…vllm-project#29870)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants