Skip to content

[BugFix] scheduler: Delay freeing blocks of aborted async loads#32255

Merged
markmc merged 1 commit intovllm-project:mainfrom
orozery:sched-delay-free-async-loads
Feb 4, 2026
Merged

[BugFix] scheduler: Delay freeing blocks of aborted async loads#32255
markmc merged 1 commit intovllm-project:mainfrom
orozery:sched-delay-free-async-loads

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Jan 13, 2026

This PR fixes the scheduler to delay the freeing of KV cache blocks of aborted requests that are waiting for remote KVs to be loaded.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue where KV cache blocks for aborted requests waiting for remote KVs were not freed properly. The changes introduce a delay in freeing blocks until the remote KV transfer completes, which is the right approach. The new tests adequately cover this bug fix scenario. However, I've identified a potential critical issue in _update_from_kv_xfer_finished where a request might be freed twice if its ID appears in both finished_recving and finished_sending lists, potentially causing a crash. I've provided a suggestion to refactor this part to make it more robust.

@orozery orozery force-pushed the sched-delay-free-async-loads branch from 502aec5 to 3fe1143 Compare January 13, 2026 11:25
@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Jan 13, 2026

Code Review

a request might be freed twice if its ID appears in both finished_recving and finished_sending lists, potentially causing a crash. I've provided a suggestion to refactor this part to make it more robust.

I don't think there's a use-case that the same request will be concurrently finished sending and recving.
Supporting this case will add too much complexity IMO.

@mergify mergify bot added the bug Something isn't working label Jan 13, 2026
@orozery orozery force-pushed the sched-delay-free-async-loads branch from 3fe1143 to 078eaf7 Compare January 21, 2026 11:24
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, but I think all of this logic is becoming extremely brittle and it is very difficult now to reassure yourself that you're not missing some corner case

The interplay of async vs sync scheduling, async vs sync loads, delayed KV block frees, recompute vs abort on KV load failure, streaming input requests, etc. etc. is getting a bit much

scheduler.update_from_output(scheduler_output, model_runner_output)

# assert request is deleted
assert request.request_id not in scheduler.requests
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered whether it would be useful to check the blocks were freed, as the NIXL tests do:

  req_to_blocks = scheduler.kv_cache_manager.coordinator.single_type_managers[0].req_to_blocks                                                                                                                                 
  assert req0_id not in req_to_blocks
```

but in either case we're verifying that `_free_blocks()` was called, so I guess not

(just noting for reference)

@markmc
Copy link
Copy Markdown
Member

markmc commented Feb 3, 2026

This PR fixes the scheduler to delay the freeing of KV cache blocks of aborted requests that are waiting for remote KVs to be loaded.

It would be nice to hear a bit more about how exactly this bug manifested ... some sort of crash or memory corruption when the offloading connector attempted to write to freed blocks? Adding logs showing this to the PR description would be helpful ...

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 3, 2026
@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Feb 3, 2026

It would be nice to hear a bit more about how exactly this bug manifested ... some sort of crash or memory corruption when the offloading connector attempted to write to freed blocks? Adding logs showing this to the PR description would be helpful ...

There was another bug (or more accurately bugs) I was trying to catch here: #29781
Since I was not able to reproduce it, I just statically went over the code and searched for bugs.
This is how I discovered this bug.
So AFAIK this bug was not actually witnessed by anyone.

…ests

This commit fixes the scheduler to delay the freeing of KV cache blocks
of requests that are waiting for remote KVs to be loaded.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@markmc markmc force-pushed the sched-delay-free-async-loads branch from cfd80b1 to 66df8d1 Compare February 3, 2026 15:42
@markmc
Copy link
Copy Markdown
Member

markmc commented Feb 3, 2026

There was another bug (or more accurately bugs) I was trying to catch here: #29781 Since I was not able to reproduce it, I just statically went over the code and searched for bugs. This is how I discovered this bug. So AFAIK this bug was not actually witnessed by anyone.

Ok, I guess even when it happens, you need the blocks to be reallocated to another request, and even then the only symptom might be garbage output?

(Thanks for the explanation)

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Feb 3, 2026

Ok, I guess even when it happens, you need the blocks to be reallocated to another request, and even then the only symptom might be garbage output?

Right, you will get corrupted KV data.

@markmc markmc merged commit 8e32690 into vllm-project:main Feb 4, 2026
41 checks passed
gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request Feb 5, 2026
…nc loads (vllm-project#32255)

Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…nc loads (vllm-project#32255)

Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…nc loads (vllm-project#32255)

Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants