-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
OffloadingConnector: Fix GPU block tracking bug #25856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical bug in the OffloadingConnector that could cause the engine to crash. The issue occurred when preparing to store K/V cache blocks for offloading failed for one request, causing the system to stop processing subsequent requests in the same batch. This resulted in inconsistent state for GPU block tracking and an IndexError. The fix correctly replaces a break with a continue statement, ensuring that even if one request fails to offload, other requests in the batch are still processed correctly. The change is small, targeted, and effectively resolves the described crash. The logic is sound and I see no further issues.
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
This commit fixes a bug in the offloading connector that may result in incorrect GPU block tracking per request. It occurs when blocks cannot be allocated on the offloaded medium (prepare_store fails), and the scheduler output has multiple requests, some of them with new GPU block IDs. Before this commit, the connector simply returned without processing the rest of the requests, and their GPU block IDs. Signed-off-by: Or Ozeri <[email protected]>
666d4a6 to
00f1347
Compare
Signed-off-by: Or Ozeri <[email protected]>
Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Or Ozeri <[email protected]>
Signed-off-by: Or Ozeri <[email protected]>
Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Or Ozeri <[email protected]>
Signed-off-by: Or Ozeri <[email protected]>
This PR fixes a bug in the offloading connector that may result in incorrect GPU block tracking per request.
It occurs when blocks cannot be allocated on the offloaded medium (prepare_store fails), and the scheduler output has multiple requests, some of them with new GPU block IDs. Before this PR, the connector simply returned without processing the rest of the requests, and their GPU block IDs.
This resulted in a crash of the engine core: