[BugFix] scheduler: Delay freeing blocks of aborted async loads by orozery · Pull Request #32255 · vllm-project/vllm

orozery · 2026-01-13T10:34:08Z

This PR fixes the scheduler to delay the freeing of KV cache blocks of aborted requests that are waiting for remote KVs to be loaded.

gemini-code-assist

Code Review

This pull request correctly fixes an issue where KV cache blocks for aborted requests waiting for remote KVs were not freed properly. The changes introduce a delay in freeing blocks until the remote KV transfer completes, which is the right approach. The new tests adequately cover this bug fix scenario. However, I've identified a potential critical issue in _update_from_kv_xfer_finished where a request might be freed twice if its ID appears in both finished_recving and finished_sending lists, potentially causing a crash. I've provided a suggestion to refactor this part to make it more robust.

tests/v1/kv_connector/unit/test_offloading_connector.py

vllm/v1/core/sched/scheduler.py

orozery · 2026-01-13T11:28:19Z

Code Review

a request might be freed twice if its ID appears in both finished_recving and finished_sending lists, potentially causing a crash. I've provided a suggestion to refactor this part to make it more robust.

I don't think there's a use-case that the same request will be concurrently finished sending and recving.
Supporting this case will add too much complexity IMO.

mergify · 2026-01-27T15:08:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc

Looks reasonable to me, but I think all of this logic is becoming extremely brittle and it is very difficult now to reassure yourself that you're not missing some corner case

The interplay of async vs sync scheduling, async vs sync loads, delayed KV block frees, recompute vs abort on KV load failure, streaming input requests, etc. etc. is getting a bit much

markmc · 2026-02-03T14:57:52Z

tests/v1/core/test_scheduler.py

+    scheduler.update_from_output(scheduler_output, model_runner_output)
+
+    # assert request is deleted
+    assert request.request_id not in scheduler.requests


I wondered whether it would be useful to check the blocks were freed, as the NIXL tests do:

req_to_blocks = scheduler.kv_cache_manager.coordinator.single_type_managers[0].req_to_blocks assert req0_id not in req_to_blocks ``` but in either case we're verifying that `_free_blocks()` was called, so I guess not (just noting for reference)

markmc · 2026-02-03T15:06:31Z

This PR fixes the scheduler to delay the freeing of KV cache blocks of aborted requests that are waiting for remote KVs to be loaded.

It would be nice to hear a bit more about how exactly this bug manifested ... some sort of crash or memory corruption when the offloading connector attempted to write to freed blocks? Adding logs showing this to the PR description would be helpful ...

orozery · 2026-02-03T15:14:20Z

It would be nice to hear a bit more about how exactly this bug manifested ... some sort of crash or memory corruption when the offloading connector attempted to write to freed blocks? Adding logs showing this to the PR description would be helpful ...

There was another bug (or more accurately bugs) I was trying to catch here: #29781
Since I was not able to reproduce it, I just statically went over the code and searched for bugs.
This is how I discovered this bug.
So AFAIK this bug was not actually witnessed by anyone.

…ests This commit fixes the scheduler to delay the freeing of KV cache blocks of requests that are waiting for remote KVs to be loaded. Signed-off-by: Or Ozeri <oro@il.ibm.com>

markmc · 2026-02-03T15:44:20Z

There was another bug (or more accurately bugs) I was trying to catch here: #29781 Since I was not able to reproduce it, I just statically went over the code and searched for bugs. This is how I discovered this bug. So AFAIK this bug was not actually witnessed by anyone.

Ok, I guess even when it happens, you need the blocks to be reallocated to another request, and even then the only symptom might be garbage output?

(Thanks for the explanation)

orozery · 2026-02-03T16:03:54Z

Ok, I guess even when it happens, you need the blocks to be reallocated to another request, and even then the only symptom might be garbage output?

Right, you will get corrupted KV data.

…nc loads (vllm-project#32255) Fixes a not-yet-reported case where it was possible for blocks to be freed by an abort before an async transfer completed, resulting in corrupted KV data. Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

…nc loads (vllm-project#32255) Fixes a not-yet-reported case where it was possible for blocks to be freed by an abort before an async transfer completed, resulting in corrupted KV data. Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners January 13, 2026 10:34

mergify bot added v1 kv-connector labels Jan 13, 2026

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

cursor bot reviewed Jan 13, 2026

View reviewed changes

tests/v1/kv_connector/unit/test_offloading_connector.py Outdated Show resolved Hide resolved

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

orozery force-pushed the sched-delay-free-async-loads branch from 502aec5 to 3fe1143 Compare January 13, 2026 11:25

mergify bot added the bug Something isn't working label Jan 13, 2026

orozery force-pushed the sched-delay-free-async-loads branch from 3fe1143 to 078eaf7 Compare January 21, 2026 11:24

mergify bot added the needs-rebase label Jan 27, 2026

orozery force-pushed the sched-delay-free-async-loads branch from 078eaf7 to cfd80b1 Compare January 28, 2026 13:11

mergify bot removed the needs-rebase label Jan 28, 2026

orozery mentioned this pull request Jan 30, 2026

[Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer #33377

Merged

5 tasks

markmc approved these changes Feb 3, 2026

View reviewed changes

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 3, 2026

[BugFix] scheduler: Delay freeing blocks of aborted async-loaded requ…

66df8d1

…ests This commit fixes the scheduler to delay the freeing of KV cache blocks of requests that are waiting for remote KVs to be loaded. Signed-off-by: Or Ozeri <oro@il.ibm.com>

markmc force-pushed the sched-delay-free-async-loads branch from cfd80b1 to 66df8d1 Compare February 3, 2026 15:42

markmc merged commit 8e32690 into vllm-project:main Feb 4, 2026
41 checks passed

This was referenced Mar 23, 2026

[Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837) #37859

Draft

Why is an assertion used here? #37837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] scheduler: Delay freeing blocks of aborted async loads#32255

[BugFix] scheduler: Delay freeing blocks of aborted async loads#32255
markmc merged 1 commit intovllm-project:mainfrom
orozery:sched-delay-free-async-loads

orozery commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

orozery commented Jan 13, 2026

Code Review

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

markmc left a comment

Uh oh!

markmc Feb 3, 2026

Uh oh!

markmc commented Feb 3, 2026

Uh oh!

orozery commented Feb 3, 2026

Uh oh!

markmc commented Feb 3, 2026

Uh oh!

orozery commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

orozery commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

orozery commented Jan 13, 2026

Code Review

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

markmc Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

markmc commented Feb 3, 2026

Uh oh!

orozery commented Feb 3, 2026

Uh oh!

markmc commented Feb 3, 2026

Uh oh!

orozery commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants