OffloadingConnector: Prevent redundant loads by orozery · Pull Request #29087 · vllm-project/vllm

orozery · 2025-11-20T11:56:23Z

When handling concurrent requests hitting the same CPU blocks, multiple concurrent CPU->GPU transfers will be issued, one per each request. If the GPU prefix cache is enabled, this will create an unnecessary duplication of KV data in the GPU prefix cache.
This PR changes the OffloadingConnector to detect such cases, and delay loading requests which have some of their blocks already being loaded by other requests.
This results in reducing the unnecessary load and waste of GPU space otherwise caused by issuing this redundant loads.
This PR also extends the OffloadingManager API to allow for delaying requests lookups.

Note

^{Cursor Bugbot is generating a summary for commit 75825f82eff921c6c40a44247f3939d99f4e4f72. Configure here.}

Note

Reduces duplicate CPU→GPU transfers when multiple requests share a prefix and GPU prefix caching is enabled.

Scheduler-side OffloadingConnector now tracks block_hashes being loaded and returns None from get_num_new_matched_tokens to defer scheduling if a lookup is pending or overlapping blocks are already loading; clears tracking on load completion
OffloadingManager.lookup API changed to return int | None; ARC/LRU managers updated accordingly (still return an int)
Request queue API simplified: remove reverse-iteration; FCFSRequestQueue.prepend_requests updated
Tests updated and expanded, including a new case ensuring concurrent lookups of the same prefix trigger only one load and the second request uses the GPU prefix cache

^{Written by Cursor Bugbot for commit 75825f82eff921c6c40a44247f3939d99f4e4f72. This will update automatically on new commits. Configure here.}

Note

Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

Scheduler OffloadingConnector: tracks block_hashes being loaded; get_num_new_matched_tokens may return None to defer scheduling if lookup pending or overlap detected; updates tracking on load prepare/complete
API: OffloadingManager.lookup now returns int | None; ARC/LRU managers updated to new signature (still return counts)
Request queue: remove reverse-iteration API; FCFSRequestQueue.prepend_requests no longer reverses input
Tests: adapt to API/behavior and add case ensuring concurrent lookups of the same prefix trigger a single load and reuse GPU prefix cache

^{Written by Cursor Bugbot for commit 4c3056f19fed137b08bbbcb4c74fb45d34948d06. This will update automatically on new commits. Configure here.}

Note

Reduces duplicate CPU→GPU loads when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

Scheduler OffloadingConnector: get_num_new_matched_tokens returns int | None; delays when manager lookup is pending or hit blocks are already loading; tracks _blocks_being_loaded and clears on load completion
API: OffloadingManager.lookup now returns int | None; ARC/LRU managers updated to new signature (still return counts)
Tests: update expectations and add a case ensuring concurrent lookups of the same prefix trigger a single load and the second request uses the GPU prefix cache

^{Written by Cursor Bugbot for commit f3450fdb8ccfb7e7ebf8ce468f7f15ef76959ed4. This will update automatically on new commits. Configure here.}

Note

Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

Scheduler OffloadingConnector: get_num_new_matched_tokens returns int | None; tracks _blocks_being_loaded to avoid duplicate loads; updates tracking on prepare/complete load
API: OffloadingManager.lookup now returns int | None to allow delayed scheduling; ARC/LRU managers updated (still return counts)
Tests: adapt to new behavior and add a case ensuring concurrent lookups of the same prefix trigger a single load and reuse the GPU prefix cache

^{Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.}

Note

Reduces redundant transfers under GPU prefix caching by deferring overlapping loads and allowing lookup deferral.

Scheduler-side OffloadingConnector: track _blocks_being_loaded; get_num_new_matched_tokens may return None when manager lookup is pending or hit blocks are already loading; update tracking on prepare/complete load
API: change OffloadingManager.lookup to return int | None; update ARC/LRU managers to new signature
Tests: adjust expectations and add a concurrent-prefix test ensuring only one load occurs and the second request reuses GPU prefix cache
Minor test harness tweak to run one extra step after EOS to kick off offloading

^{Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.}

Note

Reduces duplicate CPU→GPU transfers when multiple requests share a prefix (with GPU prefix caching enabled) by deferring overlapping loads.

Scheduler OffloadingConnector: get_num_new_matched_tokens returns int | None; tracks _blocks_being_loaded to avoid redundant loads; updates tracking on prepare/complete load
API: OffloadingManager.lookup now returns int | None to signal deferred lookups; ARC/LRU managers updated to new signature (still return counts)
Engine/test harness: minor scheduler loop tweak to run one extra step after EOS to kick off offloading; new test ensures concurrent lookups trigger only one load and the second request uses the GPU prefix cache

^{Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.}

Note

Reduces redundant CPU→GPU transfers when multiple requests share a prefix under GPU prefix caching by deferring overlapping loads.

Scheduler OffloadingConnector: get_num_new_matched_tokens now returns int | None; tracks _blocks_being_loaded to avoid duplicate loads; defers when manager lookup is pending or hit blocks are already loading; clears tracking on load completion
API: OffloadingManager.lookup signature changed to int | None; ARC/LRU managers updated to new signature (still return counts)
Tests: add concurrent-prefix case ensuring only one load occurs and the second request reuses GPU prefix cache; minor harness tweak to run one extra step after EOS to kick off offloading and adjust expectations

^{Written by Cursor Bugbot for commit 3092946a2b6991f22aed28d1c7ef49eb4d744c7c. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 472a60b. Configure here.}

Note

Reduces duplicate CPU→GPU transfers when concurrent requests share a prefix under GPU prefix caching.

Scheduler OffloadingConnector: track _blocks_being_loaded; get_num_new_matched_tokens now returns int | None and defers when manager lookup is pending or overlapping blocks are already loading; clear tracking on load completion
API: change OffloadingManager.lookup to int | None; update ARC/LRU managers to new signature (behavior unchanged)
Tests: adapt expectations and add concurrent-prefix test ensuring a single load; minor harness tweak to step once after EOS to kick off offloading

^{Written by Cursor Bugbot for commit 472a60b. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 472a60b. Configure here.}

Note

Reduces redundant CPU→GPU loads under GPU prefix caching by deferring overlapping loads and allowing lookup deferral.

Scheduler OffloadingConnector: track _blocks_being_loaded; get_num_new_matched_tokens may return None if manager lookup is pending or matched blocks are already loading; clear tracking on load completion
API: OffloadingManager.lookup returns int | None; ARC/LRU managers updated to new signature (behavior unchanged)
Tests: add concurrent-prefix case ensuring a single load; minor test harness tweak to step once after EOS to kick off offloading

^{Written by Cursor Bugbot for commit 472a60b. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request introduces a valuable optimization to the OffloadingConnector by preventing redundant CPU-to-GPU transfers of KV cache blocks. By tracking blocks that are currently being loaded, it avoids issuing duplicate load requests for concurrent requests hitting the same blocks. This is particularly beneficial when GPU prefix caching is enabled, as it reduces unnecessary data duplication and GPU memory waste. The extension of the OffloadingManager API to allow delaying request lookups is a necessary change to support this new behavior.

The overall implementation is solid, but I've identified a critical bug that could lead to a TypeError when handling completed load operations. I've included a specific comment with a code suggestion to address this issue. Once that is fixed, this PR should be in good shape.

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

LucasWilkinson · 2025-11-21T19:26:23Z

cc @NickLucche

NickLucche

Thanks for the fix @orozery . Do you think we could add a simple unit test for such cases?

This looks fine to me, but I am also not too familiar with this connector.

orozery · 2025-12-04T20:38:47Z

Thanks for the fix @orozery . Do you think we could add a simple unit test for such cases?

Sorry for the delay. Currently busy with higher priorities.
Will add a test later and ping you.
Thanks!

njhill

Thanks @orozery, nice optimization. Agree with @NickLucche that some kind of test would be good.

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

orozery · 2026-01-10T19:56:38Z

Agree with @NickLucche that some kind of test would be good.

The reason I have not yet added a test is I'm waiting on #29870 which adapts the existing unit test towards what I need to test here.

orozery · 2026-01-12T12:17:36Z

I've added a test.
I discovered a scheduler bug on the way, opened #32173 to fix it.

tests/v1/kv_connector/unit/test_offloading_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

mergify · 2026-01-12T18:44:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/v1/kv_connector/unit/test_offloading_connector.py

When handling concurrent requests hitting the same CPU blocks, multiple concurrent CPU->GPU transfers will be issued, one per each request. If the GPU prefix cache is enabled, this will create an unnecessary duplication of KV data in the GPU prefix cache. This commit changes the OffloadingConnector to detect such cases, and delay loading requests which have some of their blocks already being loaded by other requests. This results in reducing the unnecessary load and waste of GPU space otherwise caused by issuing this redundant loads. This commit also extends the OffloadingManager API to allow for delaying requests lookups. Signed-off-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery requested review from ApostaC and NickLucche as code owners November 20, 2025 11:56

mergify bot added v1 kv-connector labels Nov 20, 2025

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Show resolved Hide resolved

orozery force-pushed the offloading-prevent-redundant-loads branch from ee4598d to 5b080b3 Compare November 20, 2025 12:08

NickLucche reviewed Nov 24, 2025

View reviewed changes

njhill self-requested a review January 10, 2026 18:05

njhill reviewed Jan 10, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Outdated Show resolved Hide resolved

orozery mentioned this pull request Jan 12, 2026

[BugFix] scheduler: Fix ordering preserving of skipped requests #32173

Merged

orozery force-pushed the offloading-prevent-redundant-loads branch from 5b080b3 to 75825f8 Compare January 12, 2026 12:16

orozery requested review from WoosukKwon, alexm-redhat, heheda12345, robertgshaw2-redhat and ywang96 as code owners January 12, 2026 12:16

cursor bot reviewed Jan 12, 2026

View reviewed changes

tests/v1/kv_connector/unit/test_offloading_connector.py Outdated Show resolved Hide resolved

orozery force-pushed the offloading-prevent-redundant-loads branch from 75825f8 to 4c3056f Compare January 12, 2026 12:41

njhill reviewed Jan 12, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Jan 12, 2026

orozery force-pushed the offloading-prevent-redundant-loads branch from 4c3056f to f3450fd Compare January 12, 2026 19:03

mergify bot removed the needs-rebase label Jan 12, 2026

cursor bot reviewed Jan 12, 2026

View reviewed changes

tests/v1/kv_connector/unit/test_offloading_connector.py Show resolved Hide resolved

orozery force-pushed the offloading-prevent-redundant-loads branch from f3450fd to 3092946 Compare January 12, 2026 19:25

njhill approved these changes Jan 12, 2026

View reviewed changes

orozery force-pushed the offloading-prevent-redundant-loads branch from 3092946 to 472a60b Compare January 12, 2026 19:59

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 13, 2026

njhill enabled auto-merge (squash) January 13, 2026 05:02

njhill added 2 commits January 16, 2026 08:47

Merge branch 'main' into offloading-prevent-redundant-loads

f34c36b

Merge branch 'main' into offloading-prevent-redundant-loads

f373d7d

njhill merged commit 7013e9a into vllm-project:main Jan 21, 2026
51 checks passed

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

OffloadingConnector: Prevent redundant loads (vllm-project#29087)

bd623d6

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026

OffloadingConnector: Prevent redundant loads (vllm-project#29087)

d0bdab9

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026

OffloadingConnector: Prevent redundant loads (vllm-project#29087)

ba08f4b

Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery mentioned this pull request Feb 2, 2026

[RFC]: Progressive KV Cache CPU Onloading #33526

Open

1 task

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

OffloadingConnector: Prevent redundant loads (vllm-project#29087)

4f4e16e

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OffloadingConnector: Prevent redundant loads#29087

OffloadingConnector: Prevent redundant loads#29087
njhill merged 3 commits intovllm-project:mainfrom
orozery:offloading-prevent-redundant-loads

orozery commented Nov 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

LucasWilkinson commented Nov 21, 2025

Uh oh!

NickLucche left a comment

Uh oh!

orozery commented Dec 4, 2025

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Uh oh!

orozery commented Jan 10, 2026

Uh oh!

orozery commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

orozery commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

LucasWilkinson commented Nov 21, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

orozery commented Dec 4, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

orozery commented Jan 10, 2026

Uh oh!

orozery commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

orozery commented Nov 20, 2025 •

edited by github-actions bot

Loading