[KVConnector] OffloadingConnector: Fix bug in handling of preemptions#29870
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for handling preempted requests in the OffloadingConnector. The changes ensure that when a request is preempted, any ongoing or pending KV cache store operations for that request are correctly failed. This is achieved by clearing the relevant state for the preempted request and explicitly calling complete_store with success=False. The logic appears sound and correctly addresses the bug, preventing potential issues with stale or incomplete offloaded KV cache data. The changes are well-contained and follow the existing code structure.
579a6a1 to
0a01740
Compare
|
Do we need to cherry pick any of this back to RHAIIS releases? |
I think it's a good idea, though I don't think it's a common bug to hit. |
|
@njhill can you review this one? |
Sure, I will add one next week.
Unfortunately our scheduler-side connector only gets informed that blocks were offloaded GPU->CPU only when the request is done (using |
0a01740 to
be397c1
Compare
be397c1 to
ee01057
Compare
|
@njhill I changed this PR to flush pending writes (GPU->CPU) for preempted requests instead of discarding them. |
ee01057 to
8521295
Compare
| @@ -461,6 +472,11 @@ def register_cross_layers_kv_cache( | |||
| self._register_handlers(kv_caches, attn_backends) | |||
|
|
|||
| def start_load_kv(self, metadata: OffloadingConnectorMetadata): | |||
| for req_id in metadata.reqs_to_flush or (): | |||
There was a problem hiding this comment.
@orozery could we also add resumed requests to the metadata, and trigger the waiting here based on those instead?
| for req_id in metadata.reqs_to_flush or (): | |
| for req_id in metadata.resumed_reqs or (): |
This should be preferable perf-wise I think?
There was a problem hiding this comment.
We need to flush requests at the moment they are preempted.
This is what we do here, setting reqs_to_flush to the list of preempted requests.
Resumed requests are handled transparently by the scheduler calling get_num_new_matched_tokens and starting an async load (cpu->gpu) of the resumed requests.
8521295 to
8e52694
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
8e52694 to
f4e491f
Compare
8af44a3 to
0ff0cdc
Compare
@njhill I got a report from a user that suggests that the fix is not sufficient. |
0ff0cdc to
2d3b60c
Compare
|
Should be good now. User confirmed preempted requests are successfully re-loaded from CPU (assuming #31583 as well). |
2d3b60c to
77bab86
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
This commit fixes the OffloadingConnector to flush preempted requests to the offloading backend. Without flushing, the GPU KV data may be overwritten before the offloading completes, which could yield KV data corruption. Additionally, this fix allows re-scheduled pre-empted requests to load back KV data from the offloading backend. Signed-off-by: Or Ozeri <oro@il.ibm.com>
Head branch was pushed to by a user without write access
3f960ad to
5aa68e4
Compare
…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…vllm-project#29870) Signed-off-by: Or Ozeri <oro@il.ibm.com>
This PR fixes the OffloadingConnector to fail any stores for a preempted request, as well as allowing to store the result of the request re-computed blocks.
Note
Cursor Bugbot is generating a summary for commit 77bab86e5a415ec6ae3be7ffa56aa6f3fe24c149. Configure here.
Note
Ensures safe handling of preempted requests and completes/flushes in-flight KV offload transfers.
handle_preemptions()to KV connector base; OffloadingConnector worker now waits on pending store job IDs for preempted requests, and scheduler completes stores for preempted IDs inbuild_connector_meta()handle_preemptions()fromgpu_model_runnerbefore blocks are overwrittenOffloadingHandler/OffloadingWorkerwithwait(job_ids); CPU↔GPU handler tracks CUDA events per job and implements blockingwait()test_request_preemptionvalidating flushed block indexes and re-load/store behavior; minor test harness updatesWritten by Cursor Bugbot for commit 3f960add3709a8ba8bbf258c9039e58ce19eb698. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 5aa68e4. Configure here.
Note
Fixes offload preemption handling and adds blocking wait support across KV transfer.
handle_preemptions()to KV connector base; invoke fromgpu_model_runnerbefore blocks are overwrittenOffloadingConnectorScheduler.build_connector_meta(), mark in-flight stores complete forpreempted_req_idswait()s on their job IDs for preempted requestsOffloadingHandler/OffloadingWorkerwithwait(job_ids); CPU↔GPU handler tracks CUDA events per job and implements blockingwait()test_request_preemption; minor runner/test harness updatesWritten by Cursor Bugbot for commit 5aa68e4. This will update automatically on new commits. Configure here.