[Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837)#37859
[Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837)#37859rohithj7 wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition where a request can be aborted and cleaned up while a KV cache transfer callback is still in flight. The fix replaces an assertion that the request exists with a check, allowing stale callbacks to be safely ignored for both receiving and sending transfers. This prevents a crash that would otherwise occur. New tests have been added to cover these specific scenarios. The changes appear to correctly resolve the issue.
Purpose
#37837: in PD setups you can abort a request on the decode side while a KV transfer is still finishing. Cleanup removes the request from
self.requests, then_update_from_kv_xfer_finished()gets a latefinished_recving/finished_sendingfor that id. The code asserted the id was still inself.requests, which isn't true in that case, so the decode node dies.Fix is to treat that as a stale callback and skip it instead of asserting. Added a couple tests for the late-notification path.
Test Plan
Linux + Python 3.12 +
requirements/test.txtis what matches CI (the lockfile is Linux/CUDA oriented).Lint:
Tests (focused):
python -m pytest tests/v1/core/test_scheduler.py -v \ -k "abort_request_waiting_for_remote_kvs or abort_request_finished_recving or ignore_late_finished"Whole file if you feel like it:
Commit with
-sfor DCO (Signed-off-by).Test Result
Will paste pytest output here after I run the above on Linux.
Before: n/a (or describe how you repro’d it)
After:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.