OffloadingConnector: Prevent redundant loads#29087
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable optimization to the OffloadingConnector by preventing redundant CPU-to-GPU transfers of KV cache blocks. By tracking blocks that are currently being loaded, it avoids issuing duplicate load requests for concurrent requests hitting the same blocks. This is particularly beneficial when GPU prefix caching is enabled, as it reduces unnecessary data duplication and GPU memory waste. The extension of the OffloadingManager API to allow delaying request lookups is a necessary change to support this new behavior.
The overall implementation is solid, but I've identified a critical bug that could lead to a TypeError when handling completed load operations. I've included a specific comment with a code suggestion to address this issue. Once that is fixed, this PR should be in good shape.
ee4598d to
5b080b3
Compare
|
cc @NickLucche |
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the fix @orozery . Do you think we could add a simple unit test for such cases?
This looks fine to me, but I am also not too familiar with this connector.
Sorry for the delay. Currently busy with higher priorities. |
njhill
left a comment
There was a problem hiding this comment.
Thanks @orozery, nice optimization. Agree with @NickLucche that some kind of test would be good.
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
The reason I have not yet added a test is I'm waiting on #29870 which adapts the existing unit test towards what I need to test here. |
5b080b3 to
75825f8
Compare
|
I've added a test. |
75825f8 to
4c3056f
Compare
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
|
This pull request has merge conflicts that must be resolved before it can be |
4c3056f to
f3450fd
Compare
f3450fd to
3092946
Compare
When handling concurrent requests hitting the same CPU blocks, multiple concurrent CPU->GPU transfers will be issued, one per each request. If the GPU prefix cache is enabled, this will create an unnecessary duplication of KV data in the GPU prefix cache. This commit changes the OffloadingConnector to detect such cases, and delay loading requests which have some of their blocks already being loaded by other requests. This results in reducing the unnecessary load and waste of GPU space otherwise caused by issuing this redundant loads. This commit also extends the OffloadingManager API to allow for delaying requests lookups. Signed-off-by: Or Ozeri <oro@il.ibm.com>
3092946 to
472a60b
Compare
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
When handling concurrent requests hitting the same CPU blocks, multiple concurrent CPU->GPU transfers will be issued, one per each request. If the GPU prefix cache is enabled, this will create an unnecessary duplication of KV data in the GPU prefix cache.
This PR changes the OffloadingConnector to detect such cases, and delay loading requests which have some of their blocks already being loaded by other requests.
This results in reducing the unnecessary load and waste of GPU space otherwise caused by issuing this redundant loads.
This PR also extends the OffloadingManager API to allow for delaying requests lookups.
Note
Cursor Bugbot is generating a summary for commit 75825f82eff921c6c40a44247f3939d99f4e4f72. Configure here.
Note
Reduces duplicate CPU→GPU transfers when multiple requests share a prefix and GPU prefix caching is enabled.
OffloadingConnectornow tracksblock_hashes being loaded and returnsNonefromget_num_new_matched_tokensto defer scheduling if a lookup is pending or overlapping blocks are already loading; clears tracking on load completionOffloadingManager.lookupAPI changed to returnint | None; ARC/LRU managers updated accordingly (still return anint)FCFSRequestQueue.prepend_requestsupdatedWritten by Cursor Bugbot for commit 75825f82eff921c6c40a44247f3939d99f4e4f72. This will update automatically on new commits. Configure here.
Note
Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.
OffloadingConnector: tracksblock_hashes being loaded;get_num_new_matched_tokensmay returnNoneto defer scheduling if lookup pending or overlap detected; updates tracking on load prepare/completeOffloadingManager.lookupnow returnsint | None; ARC/LRU managers updated to new signature (still return counts)FCFSRequestQueue.prepend_requestsno longer reverses inputWritten by Cursor Bugbot for commit 4c3056f19fed137b08bbbcb4c74fb45d34948d06. This will update automatically on new commits. Configure here.
Note
Reduces duplicate CPU→GPU loads when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.
OffloadingConnector:get_num_new_matched_tokensreturnsint | None; delays when manager lookup is pending or hit blocks are already loading; tracks_blocks_being_loadedand clears on load completionOffloadingManager.lookupnow returnsint | None; ARC/LRU managers updated to new signature (still return counts)Written by Cursor Bugbot for commit f3450fdb8ccfb7e7ebf8ce468f7f15ef76959ed4. This will update automatically on new commits. Configure here.
Note
Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.
OffloadingConnector:get_num_new_matched_tokensreturnsint | None; tracks_blocks_being_loadedto avoid duplicate loads; updates tracking on prepare/complete loadOffloadingManager.lookupnow returnsint | Noneto allow delayed scheduling; ARC/LRU managers updated (still return counts)Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.
Note
Reduces redundant transfers under GPU prefix caching by deferring overlapping loads and allowing lookup deferral.
OffloadingConnector: track_blocks_being_loaded;get_num_new_matched_tokensmay returnNonewhen manager lookup is pending or hit blocks are already loading; update tracking on prepare/complete loadOffloadingManager.lookupto returnint | None; update ARC/LRU managers to new signatureWritten by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.
Note
Reduces duplicate CPU→GPU transfers when multiple requests share a prefix (with GPU prefix caching enabled) by deferring overlapping loads.
OffloadingConnector:get_num_new_matched_tokensreturnsint | None; tracks_blocks_being_loadedto avoid redundant loads; updates tracking on prepare/complete loadOffloadingManager.lookupnow returnsint | Noneto signal deferred lookups; ARC/LRU managers updated to new signature (still return counts)EOSto kick off offloading; new test ensures concurrent lookups trigger only one load and the second request uses the GPU prefix cacheWritten by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.
Note
Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.
OffloadingConnector:get_num_new_matched_tokensnow returnsint | None; tracks_blocks_being_loadedto avoid duplicate loads; defers when manager lookup is pending or hit blocks are already loading; clears tracking on load completionOffloadingManager.lookupsignature changed toint | None; ARC/LRU managers updated to new signature (still return counts)EOSto kick off offloading and adjust expectationsWritten by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 472a60b. Configure here.
Note
Reduces duplicate CPU→GPU transfers when concurrent requests share a prefix under GPU prefix caching.
OffloadingConnector: track_blocks_being_loaded;get_num_new_matched_tokensnow returnsint | Noneand defers when manager lookup is pending or overlapping blocks are already loading; clear tracking on load completionOffloadingManager.lookuptoint | None; update ARC/LRU managers to new signature (behavior unchanged)EOSto kick off offloadingWritten by Cursor Bugbot for commit 472a60b. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 472a60b. Configure here.
Note
Reduces redundant CPU→GPU loads under GPU prefix caching by deferring overlapping loads and allowing lookup deferral.
OffloadingConnector: track_blocks_being_loaded;get_num_new_matched_tokensmay returnNoneif manager lookup is pending or matched blocks are already loading; clear tracking on load completionOffloadingManager.lookupreturnsint | None; ARC/LRU managers updated to new signature (behavior unchanged)EOSto kick off offloadingWritten by Cursor Bugbot for commit 472a60b. This will update automatically on new commits. Configure here.