Skip to content

OffloadingConnector: Prevent redundant loads#29087

Merged
njhill merged 3 commits intovllm-project:mainfrom
orozery:offloading-prevent-redundant-loads
Jan 21, 2026
Merged

OffloadingConnector: Prevent redundant loads#29087
njhill merged 3 commits intovllm-project:mainfrom
orozery:offloading-prevent-redundant-loads

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Nov 20, 2025

When handling concurrent requests hitting the same CPU blocks, multiple concurrent CPU->GPU transfers will be issued, one per each request. If the GPU prefix cache is enabled, this will create an unnecessary duplication of KV data in the GPU prefix cache.
This PR changes the OffloadingConnector to detect such cases, and delay loading requests which have some of their blocks already being loaded by other requests.
This results in reducing the unnecessary load and waste of GPU space otherwise caused by issuing this redundant loads.
This PR also extends the OffloadingManager API to allow for delaying requests lookups.


Note

Cursor Bugbot is generating a summary for commit 75825f82eff921c6c40a44247f3939d99f4e4f72. Configure here.


Note

Reduces duplicate CPU→GPU transfers when multiple requests share a prefix and GPU prefix caching is enabled.

  • Scheduler-side OffloadingConnector now tracks block_hashes being loaded and returns None from get_num_new_matched_tokens to defer scheduling if a lookup is pending or overlapping blocks are already loading; clears tracking on load completion
  • OffloadingManager.lookup API changed to return int | None; ARC/LRU managers updated accordingly (still return an int)
  • Request queue API simplified: remove reverse-iteration; FCFSRequestQueue.prepend_requests updated
  • Tests updated and expanded, including a new case ensuring concurrent lookups of the same prefix trigger only one load and the second request uses the GPU prefix cache

Written by Cursor Bugbot for commit 75825f82eff921c6c40a44247f3939d99f4e4f72. This will update automatically on new commits. Configure here.


Note

Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

  • Scheduler OffloadingConnector: tracks block_hashes being loaded; get_num_new_matched_tokens may return None to defer scheduling if lookup pending or overlap detected; updates tracking on load prepare/complete
  • API: OffloadingManager.lookup now returns int | None; ARC/LRU managers updated to new signature (still return counts)
  • Request queue: remove reverse-iteration API; FCFSRequestQueue.prepend_requests no longer reverses input
  • Tests: adapt to API/behavior and add case ensuring concurrent lookups of the same prefix trigger a single load and reuse GPU prefix cache

Written by Cursor Bugbot for commit 4c3056f19fed137b08bbbcb4c74fb45d34948d06. This will update automatically on new commits. Configure here.


Note

Reduces duplicate CPU→GPU loads when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

  • Scheduler OffloadingConnector: get_num_new_matched_tokens returns int | None; delays when manager lookup is pending or hit blocks are already loading; tracks _blocks_being_loaded and clears on load completion
  • API: OffloadingManager.lookup now returns int | None; ARC/LRU managers updated to new signature (still return counts)
  • Tests: update expectations and add a case ensuring concurrent lookups of the same prefix trigger a single load and the second request uses the GPU prefix cache

Written by Cursor Bugbot for commit f3450fdb8ccfb7e7ebf8ce468f7f15ef76959ed4. This will update automatically on new commits. Configure here.


Note

Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

  • Scheduler OffloadingConnector: get_num_new_matched_tokens returns int | None; tracks _blocks_being_loaded to avoid duplicate loads; updates tracking on prepare/complete load
  • API: OffloadingManager.lookup now returns int | None to allow delayed scheduling; ARC/LRU managers updated (still return counts)
  • Tests: adapt to new behavior and add a case ensuring concurrent lookups of the same prefix trigger a single load and reuse the GPU prefix cache

Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.


Note

Reduces redundant transfers under GPU prefix caching by deferring overlapping loads and allowing lookup deferral.

  • Scheduler-side OffloadingConnector: track _blocks_being_loaded; get_num_new_matched_tokens may return None when manager lookup is pending or hit blocks are already loading; update tracking on prepare/complete load
  • API: change OffloadingManager.lookup to return int | None; update ARC/LRU managers to new signature
  • Tests: adjust expectations and add a concurrent-prefix test ensuring only one load occurs and the second request reuses GPU prefix cache
  • Minor test harness tweak to run one extra step after EOS to kick off offloading

Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.


Note

Reduces duplicate CPU→GPU transfers when multiple requests share a prefix (with GPU prefix caching enabled) by deferring overlapping loads.

  • Scheduler OffloadingConnector: get_num_new_matched_tokens returns int | None; tracks _blocks_being_loaded to avoid redundant loads; updates tracking on prepare/complete load
  • API: OffloadingManager.lookup now returns int | None to signal deferred lookups; ARC/LRU managers updated to new signature (still return counts)
  • Engine/test harness: minor scheduler loop tweak to run one extra step after EOS to kick off offloading; new test ensures concurrent lookups trigger only one load and the second request uses the GPU prefix cache

Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.


Note

Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

  • Scheduler OffloadingConnector: get_num_new_matched_tokens now returns int | None; tracks _blocks_being_loaded to avoid duplicate loads; defers when manager lookup is pending or hit blocks are already loading; clears tracking on load completion
  • API: OffloadingManager.lookup signature changed to int | None; ARC/LRU managers updated to new signature (still return counts)
  • Tests: add concurrent-prefix case ensuring only one load occurs and the second request reuses GPU prefix cache; minor harness tweak to run one extra step after EOS to kick off offloading and adjust expectations

Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 472a60b. Configure here.


Note

Reduces duplicate CPU→GPU transfers when concurrent requests share a prefix under GPU prefix caching.

  • Scheduler OffloadingConnector: track _blocks_being_loaded; get_num_new_matched_tokens now returns int | None and defers when manager lookup is pending or overlapping blocks are already loading; clear tracking on load completion
  • API: change OffloadingManager.lookup to int | None; update ARC/LRU managers to new signature (behavior unchanged)
  • Tests: adapt expectations and add concurrent-prefix test ensuring a single load; minor harness tweak to step once after EOS to kick off offloading

Written by Cursor Bugbot for commit 472a60b. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 472a60b. Configure here.


Note

Reduces redundant CPU→GPU loads under GPU prefix caching by deferring overlapping loads and allowing lookup deferral.

  • Scheduler OffloadingConnector: track _blocks_being_loaded; get_num_new_matched_tokens may return None if manager lookup is pending or matched blocks are already loading; clear tracking on load completion
  • API: OffloadingManager.lookup returns int | None; ARC/LRU managers updated to new signature (behavior unchanged)
  • Tests: add concurrent-prefix case ensuring a single load; minor test harness tweak to step once after EOS to kick off offloading

Written by Cursor Bugbot for commit 472a60b. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization to the OffloadingConnector by preventing redundant CPU-to-GPU transfers of KV cache blocks. By tracking blocks that are currently being loaded, it avoids issuing duplicate load requests for concurrent requests hitting the same blocks. This is particularly beneficial when GPU prefix caching is enabled, as it reduces unnecessary data duplication and GPU memory waste. The extension of the OffloadingManager API to allow delaying request lookups is a necessary change to support this new behavior.

The overall implementation is solid, but I've identified a critical bug that could lead to a TypeError when handling completed load operations. I've included a specific comment with a code suggestion to address this issue. Once that is fixed, this PR should be in good shape.

@orozery orozery force-pushed the offloading-prevent-redundant-loads branch from ee4598d to 5b080b3 Compare November 20, 2025 12:08
@LucasWilkinson
Copy link
Copy Markdown
Collaborator

cc @NickLucche

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @orozery . Do you think we could add a simple unit test for such cases?

This looks fine to me, but I am also not too familiar with this connector.

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Dec 4, 2025

Thanks for the fix @orozery . Do you think we could add a simple unit test for such cases?

Sorry for the delay. Currently busy with higher priorities.
Will add a test later and ping you.
Thanks!

@njhill njhill self-requested a review January 10, 2026 18:05
Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orozery, nice optimization. Agree with @NickLucche that some kind of test would be good.

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Jan 10, 2026

Agree with @NickLucche that some kind of test would be good.

The reason I have not yet added a test is I'm waiting on #29870 which adapts the existing unit test towards what I need to test here.

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Jan 12, 2026

I've added a test.
I discovered a scheduler bug on the way, opened #32173 to fix it.

@orozery orozery force-pushed the offloading-prevent-redundant-loads branch from 75825f8 to 4c3056f Compare January 12, 2026 12:41
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 12, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 12, 2026
@orozery orozery force-pushed the offloading-prevent-redundant-loads branch from 4c3056f to f3450fd Compare January 12, 2026 19:03
@mergify mergify bot removed the needs-rebase label Jan 12, 2026
@orozery orozery force-pushed the offloading-prevent-redundant-loads branch from f3450fd to 3092946 Compare January 12, 2026 19:25
When handling concurrent requests hitting the same CPU blocks,
multiple concurrent CPU->GPU transfers will be issued, one per each request.
If the GPU prefix cache is enabled, this will create an unnecessary duplication
of KV data in the GPU prefix cache.
This commit changes the OffloadingConnector to detect such cases, and
delay loading requests which have some of their blocks already being loaded
by other requests.
This results in reducing the unnecessary load and waste of GPU space
otherwise caused by issuing this redundant loads.
This commit also extends the OffloadingManager API to allow for
delaying requests lookups.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery force-pushed the offloading-prevent-redundant-loads branch from 3092946 to 472a60b Compare January 12, 2026 19:59
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 13, 2026
@njhill njhill enabled auto-merge (squash) January 13, 2026 05:02
@njhill njhill merged commit 7013e9a into vllm-project:main Jan 21, 2026
51 checks passed
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants