[Core] Proactively free KV cache blocks when aborting finished requests#35506
[Core] Proactively free KV cache blocks when aborting finished requests#35506jianzs wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances memory management by proactively freeing KV cache blocks for requests that are aborted after they have already finished. The changes in vllm/v1/core/sched/scheduler.py correctly implement this by adding logic to finish_requests to immediately free blocks for finished requests upon receiving a FINISHED_ABORTED status. The related change to handle a potential race condition in _update_from_kv_xfer_finished is also correct and makes the system more robust. The test case in tests/v1/kv_connector/unit/test_remote_decode_lifecycle.py has been updated appropriately to validate the new behavior. Overall, the changes are well-implemented and achieve the goal of earlier memory reclamation.
|
This changes basically allows the scheduler to unilateraly break the contract "async-save" contract between the scheduler and the connector. I think a better alternative is to allow the scheduler to report the "abort after request finished" case to the connector, to allow the connector to clean up and make sure both sides agree on cleanup. |
@orozery Thanks for the feedback—do you have any thoughts on how to send this message to the connector? |
This will require some changes in the connector API, which we are trying to minimize for now while we are re-iterating its design towards a new connector v2 API. For a short-term solution, I think we can handle it this way:
|
This is an enhancement to PR vllm-project#25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request. This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
This addresses review feedback for PR vllm-project#35506. The original implementation broke the contract between scheduler and connector by directly freeing blocks without notifying the connector. Changes: - Scheduler: Set request.status to FINISHED_ABORTED and call connector for cleanup instead of immediately freeing blocks - NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for cleanup via finished_sending mechanism - NixlConnectorMetadata: Add reqs_abort_done field for worker communication - NixlConnectorWorker: Track aborted requests and report via finished_sending This ensures: - The scheduler-connector contract is maintained - Connector participates in cleanup process - Blocks are freed through the established finished_sending mechanism Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
0ea19d8 to
8564936
Compare
|
@orozery Thanks for the detailed feedback! I've implemented your suggested approach: Implementation Summary:
Key Design Decision:
Testing:
Please let me know if this aligns with your expectations or if any adjustments are needed. |
|
Hi @jianzs, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
This addresses review feedback for PR vllm-project#35506. The original implementation broke the contract between scheduler and connector by directly freeing blocks without notifying the connector. Changes: - Scheduler: Set request.status to FINISHED_ABORTED and call connector for cleanup instead of immediately freeing blocks - NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for cleanup via finished_sending mechanism - NixlConnectorMetadata: Add reqs_abort_done field for worker communication - NixlConnectorWorker: Track aborted requests and report via finished_sending This ensures: - The scheduler-connector contract is maintained - Connector participates in cleanup process - Blocks are freed through the established finished_sending mechanism Signed-off-by: Shoujian Zheng <zheng.shoujian@outlook.com>
8564936 to
fa732c3
Compare
This addresses review feedback for PR vllm-project#35506. The original implementation broke the contract between scheduler and connector by directly freeing blocks without notifying the connector. Changes: - Scheduler: Set request.status to FINISHED_ABORTED and call connector for cleanup instead of immediately freeing blocks - NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for cleanup via finished_sending mechanism - NixlConnectorMetadata: Add reqs_abort_done field for worker communication - NixlConnectorWorker: Track aborted requests and report via finished_sending This ensures: - The scheduler-connector contract is maintained - Connector participates in cleanup process - Blocks are freed through the established finished_sending mechanism Signed-off-by: Shoujian Zheng <zheng.shoujian@outlook.com>
fa732c3 to
ece5205
Compare
This addresses review feedback for PR vllm-project#35506. The original implementation broke the contract between scheduler and connector by directly freeing blocks without notifying the connector. Changes: - Scheduler: Set request.status to FINISHED_ABORTED and call connector for cleanup instead of immediately freeing blocks - NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for cleanup via finished_sending mechanism - NixlConnectorMetadata: Add reqs_abort_done field for worker communication - NixlConnectorWorker: Track aborted requests and report via finished_sending This ensures: - The scheduler-connector contract is maintained - Connector participates in cleanup process - Blocks are freed through the established finished_sending mechanism Signed-off-by: Shoujian Zheng <zheng.shoujian@outlook.com>
ece5205 to
753add1
Compare
| # Notify connector to participate in cleanup. Blocks will be | ||
| # freed when connector reports finished_sending. | ||
| # A finished request can only exist in self.requests when | ||
| # connector delays block freeing (P/D scenario). | ||
| assert self.connector is not None | ||
| self._connector_finished(request) |
There was a problem hiding this comment.
I think we need to remove this.
Connectors may assume that request_finished is called only once.
See #33377 for example.
Instead, NixlConnector will have to poll each of its "being sent" requests to see if their status was changed.
This is an enhancement to PR #25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request.
This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high.