[KVConnector] OffloadingConnector: Fix bug in handling of preemptions by orozery · Pull Request #29870 · vllm-project/vllm

orozery · 2025-12-02T11:17:47Z

This PR fixes the OffloadingConnector to fail any stores for a preempted request, as well as allowing to store the result of the request re-computed blocks.

Note

^{Cursor Bugbot is generating a summary for commit 77bab86e5a415ec6ae3be7ffa56aa6f3fe24c149. Configure here.}

Note

Ensures safe handling of preempted requests and completes/flushes in-flight KV offload transfers.

Adds handle_preemptions() to KV connector base; OffloadingConnector worker now waits on pending store job IDs for preempted requests, and scheduler completes stores for preempted IDs in build_connector_meta()
Calls handle_preemptions() from gpu_model_runner before blocks are overwritten
Extends OffloadingHandler/OffloadingWorker with wait(job_ids); CPU↔GPU handler tracks CUDA events per job and implements blocking wait()
Worker-side OffloadingConnector defers store submission to next step but can now force-wait on preemption; finished-getters unchanged besides plumbing
Tests: refactor mocks to track job/spec state and flushed jobs; add test_request_preemption validating flushed block indexes and re-load/store behavior; minor test harness updates

^{Written by Cursor Bugbot for commit 3f960add3709a8ba8bbf258c9039e58ce19eb698. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 5aa68e4. Configure here.}

Note

Fixes offload preemption handling and adds blocking wait support across KV transfer.

Add handle_preemptions() to KV connector base; invoke from gpu_model_runner before blocks are overwritten
Scheduler side: in OffloadingConnectorScheduler.build_connector_meta(), mark in-flight stores complete for preempted_req_ids
Worker side: OffloadingConnector now submits deferred store jobs and wait()s on their job IDs for preempted requests
Extend OffloadingHandler/OffloadingWorker with wait(job_ids); CPU↔GPU handler tracks CUDA events per job and implements blocking wait()
Tests: refactor mocks to track job/spec state and flushed jobs; add test_request_preemption; minor runner/test harness updates

^{Written by Cursor Bugbot for commit 5aa68e4. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request introduces a fix for handling preempted requests in the OffloadingConnector. The changes ensure that when a request is preempted, any ongoing or pending KV cache store operations for that request are correctly failed. This is achieved by clearing the relevant state for the preempted request and explicitly calling complete_store with success=False. The logic appears sound and correctly addresses the bug, preventing potential issues with stale or incomplete offloaded KV cache data. The changes are well-contained and follow the existing code structure.

robertgshaw2-redhat · 2025-12-02T14:10:24Z

Do we need to cherry pick any of this back to RHAIIS releases?

orozery · 2025-12-02T15:38:23Z

Do we need to cherry pick any of this back to RHAIIS releases?

I think it's a good idea, though I don't think it's a common bug to hit.
This one is much more important bugfix: #28951.
This one #27743 is also critical as it speed-ups CPU offloading in an order of magnitude, which in turn helps reduce stress of the GPU KV cache capacity utilization.

robertgshaw2-redhat · 2025-12-04T15:27:12Z

@njhill can you review this one?

njhill

Thanks @orozery, looks good to me. Is there a reasonable way this could be covered in a test?

Slightly-related: Do we load from the connector when resuming/recomputing preempted requests?

orozery · 2025-12-04T20:35:24Z

Thanks @orozery, looks good to me. Is there a reasonable way this could be covered in a test?

Sure, I will add one next week.

Slightly-related: Do we load from the connector when resuming/recomputing preempted requests?

Unfortunately our scheduler-side connector only gets informed that blocks were offloaded GPU->CPU only when the request is done (using KVConnectorOutput.finished_sending).
I am planning to add KVConnectorWorkerMeta to KVConnectorOutput which should enable us to handle it.

orozery · 2025-12-10T16:17:57Z

@njhill I changed this PR to flush pending writes (GPU->CPU) for preempted requests instead of discarding them.
This now allows to re-load the requests from CPU when they are re-scheduled, instead of re-computing.
Also added a unit test.

njhill

Thanks @orozery, sorry for the delay, looks great to me.

vllm/v1/kv_offload/worker/worker.py

vllm/v1/kv_offload/worker/cpu_gpu.py

njhill · 2025-12-13T17:38:41Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

@@ -461,6 +472,11 @@ def register_cross_layers_kv_cache(
        self._register_handlers(kv_caches, attn_backends)

    def start_load_kv(self, metadata: OffloadingConnectorMetadata):
+        for req_id in metadata.reqs_to_flush or ():


@orozery could we also add resumed requests to the metadata, and trigger the waiting here based on those instead?

Suggested change

for req_id in metadata.reqs_to_flush or ():

for req_id in metadata.resumed_reqs or ():

This should be preferable perf-wise I think?

We need to flush requests at the moment they are preempted.
This is what we do here, setting reqs_to_flush to the list of preempted requests.
Resumed requests are handled transparently by the scheduler calling get_num_new_matched_tokens and starting an async load (cpu->gpu) of the resumed requests.

mergify · 2025-12-15T04:25:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

orozery · 2025-12-29T17:51:25Z

Thanks @orozery, sorry for the delay, looks great to me.

@njhill I got a report from a user that suggests that the fix is not sufficient.
I deep-dived to the execute_model function in the gpu_model_runner.py and from what I understand, blocks that belong to pre-empted requests can get overwritten in _prepare_inputs, before we get to flush them to CPU (our flush was called in start_load_kv).
To overcome this, I introduced a new handle_preemptions worker-side connector API that is called at the beginning of the execute_model function, before _prepare_inputs gets into play.

orozery · 2026-01-01T13:30:05Z

Should be good now. User confirmed preempted requests are successfully re-loaded from CPU (assuming #31583 as well).

njhill

Thanks @orozery, looks great!

Just had one tiny suggestion, since you'll need to rebase anyhow.

Sorry for taking so long to get back to these.

Would be good for @ApostaC to check it too but don't need to hold it up for that.

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

mergify · 2026-01-10T23:03:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

This commit fixes the OffloadingConnector to flush preempted requests to the offloading backend. Without flushing, the GPU KV data may be overwritten before the offloading completes, which could yield KV data corruption. Additionally, this fix allows re-scheduled pre-empted requests to load back KV data from the offloading backend. Signed-off-by: Or Ozeri <oro@il.ibm.com>

…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com>

…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery requested review from ApostaC and NickLucche as code owners December 2, 2025 11:17

mergify bot added the kv-connector label Dec 2, 2025

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

orozery mentioned this pull request Dec 2, 2025

[Bug]: Wrong / Repetitive Generation Under High Concurrency When Using LMCache CPU Offload #29781

Open

1 task

orozery force-pushed the offloading-connector-preemptions branch from 579a6a1 to 0a01740 Compare December 2, 2025 11:29

robertgshaw2-redhat requested a review from njhill December 4, 2025 15:26

robertgshaw2-redhat changed the title ~~OffloadingConnector: Fix bug in handling of preemptions~~ [KVConnector] OffloadingConnector: Fix bug in handling of preemptions Dec 4, 2025

robertgshaw2-redhat assigned njhill Dec 4, 2025

njhill approved these changes Dec 4, 2025

View reviewed changes

orozery force-pushed the offloading-connector-preemptions branch from 0a01740 to be397c1 Compare December 10, 2025 15:07

mergify bot added the v1 label Dec 10, 2025

orozery force-pushed the offloading-connector-preemptions branch from be397c1 to ee01057 Compare December 10, 2025 15:24

orozery force-pushed the offloading-connector-preemptions branch from ee01057 to 8521295 Compare December 10, 2025 18:11

orozery mentioned this pull request Dec 12, 2025

[NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA #30419

Merged

5 tasks

njhill reviewed Dec 13, 2025

View reviewed changes

orozery force-pushed the offloading-connector-preemptions branch from 8521295 to 8e52694 Compare December 13, 2025 18:30

mergify bot added the needs-rebase label Dec 15, 2025

orozery force-pushed the offloading-connector-preemptions branch from 8e52694 to f4e491f Compare December 15, 2025 04:46

mergify bot removed the needs-rebase label Dec 15, 2025

orozery force-pushed the offloading-connector-preemptions branch 2 times, most recently from 8af44a3 to 0ff0cdc Compare December 29, 2025 17:45

orozery force-pushed the offloading-connector-preemptions branch from 0ff0cdc to 2d3b60c Compare December 30, 2025 19:56

njhill approved these changes Jan 10, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Show resolved Hide resolved

orozery force-pushed the offloading-connector-preemptions branch from 2d3b60c to 77bab86 Compare January 10, 2026 16:32

cursor bot reviewed Jan 10, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Show resolved Hide resolved

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 10, 2026

njhill enabled auto-merge (squash) January 10, 2026 19:30

njhill added this to the v0.14.0 milestone Jan 10, 2026

orozery mentioned this pull request Jan 10, 2026

OffloadingConnector: Prevent redundant loads #29087

Merged

mergify bot added the needs-rebase label Jan 10, 2026

njhill disabled auto-merge January 11, 2026 03:26

njhill enabled auto-merge (squash) January 11, 2026 03:26

mergify bot removed the needs-rebase label Jan 11, 2026

cursor bot reviewed Jan 11, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Show resolved Hide resolved

auto-merge was automatically disabled January 11, 2026 04:39
Head branch was pushed to by a user without write access

orozery force-pushed the offloading-connector-preemptions branch from 3f960ad to 5aa68e4 Compare January 11, 2026 04:39

njhill enabled auto-merge (squash) January 11, 2026 06:34

njhill merged commit 4c16ba6 into vllm-project:main Jan 11, 2026
55 checks passed

orozery mentioned this pull request Jan 15, 2026

[KVConnector] OffloadingConnector: Add preemptions-only mode #32398

Closed

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[KVConnector] OffloadingConnector: Fix bug in handling of preemptions (…

1bfd660

…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[KVConnector] OffloadingConnector: Fix bug in handling of preemptions (…

772903c

…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[KVConnector] OffloadingConnector: Fix bug in handling of preemptions (…

96617e7

…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com>

	for req_id in metadata.reqs_to_flush or ():
	for req_id in metadata.resumed_reqs or ():

Uh oh!

Conversation

orozery commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

robertgshaw2-redhat commented Dec 2, 2025

Uh oh!

orozery commented Dec 2, 2025

Uh oh!

robertgshaw2-redhat commented Dec 4, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

orozery commented Dec 4, 2025

Uh oh!

orozery commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

orozery Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

orozery commented Dec 29, 2025

Uh oh!

orozery commented Jan 1, 2026

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

orozery commented Dec 2, 2025 •

edited by github-actions bot

Loading

orozery commented Dec 10, 2025 •

edited

Loading