[Core] Proactively free KV cache blocks when aborting finished requests by jianzs · Pull Request #35506 · vllm-project/vllm

jianzs · 2026-02-27T11:45:17Z

This is an enhancement to PR #25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request.

This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high.

gemini-code-assist

Code Review

This pull request enhances memory management by proactively freeing KV cache blocks for requests that are aborted after they have already finished. The changes in vllm/v1/core/sched/scheduler.py correctly implement this by adding logic to finish_requests to immediately free blocks for finished requests upon receiving a FINISHED_ABORTED status. The related change to handle a potential race condition in _update_from_kv_xfer_finished is also correct and makes the system more robust. The test case in tests/v1/kv_connector/unit/test_remote_decode_lifecycle.py has been updated appropriately to validate the new behavior. Overall, the changes are well-implemented and achieve the goal of earlier memory reclamation.

orozery · 2026-02-27T12:06:18Z

This changes basically allows the scheduler to unilateraly break the contract "async-save" contract between the scheduler and the connector.
The contract is:
If the connector returns delay_free_blocks=True when the request finishes, the request blocks will not free until the connector returns the request id in finished_sending.
With this unilateral break of contract, you replace an assert with a warning.
And now it is up to the connector to figure things out on this contract break.

I think a better alternative is to allow the scheduler to report the "abort after request finished" case to the connector, to allow the connector to clean up and make sure both sides agree on cleanup.

jianzs · 2026-03-03T09:30:52Z

This changes basically allows the scheduler to unilateraly break the contract "async-save" contract between the scheduler and the connector. The contract is: If the connector returns delay_free_blocks=True when the request finishes, the request blocks will not free until the connector returns the request id in finished_sending. With this unilateral break of contract, you replace an assert with a warning. And now it is up to the connector to figure things out on this contract break.

I think a better alternative is to allow the scheduler to report the "abort after request finished" case to the connector, to allow the connector to clean up and make sure both sides agree on cleanup.

@orozery Thanks for the feedback—do you have any thoughts on how to send this message to the connector?

orozery · 2026-03-03T10:09:16Z

@orozery Thanks for the feedback—do you have any thoughts on how to send this message to the connector?

This will require some changes in the connector API, which we are trying to minimize for now while we are re-iterating its design towards a new connector v2 API.
This is one point we should consider for the v2 API.

For a short-term solution, I think we can handle it this way:

At Scheduler.finish_requests, if request.is_finished() set request.status = finished_status so that the Request object will reflect the fact that the request was aborted.
At NixlConnectorScheduler.request_finished save the request at a new self._reqs_being_sent: dict[req_id, Request].
At NixlConnectorScheduler.build_connector_meta check if any of the requests from self._reqs_being_sent was aborted (by checking their status). For any such aborted request, pass this information to the workers via a new NixlConnectorMetadata.reqs_to_abort: set[ReqId].
At NixlConnectorWorker.get_finished add the request IDs of all aborted requests to done_sending, while also performing any necessary cleanups.

cc @NickLucche @njhill

This is an enhancement to PR vllm-project#25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request. This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

This addresses review feedback for PR vllm-project#35506. The original implementation broke the contract between scheduler and connector by directly freeing blocks without notifying the connector. Changes: - Scheduler: Set request.status to FINISHED_ABORTED and call connector for cleanup instead of immediately freeing blocks - NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for cleanup via finished_sending mechanism - NixlConnectorMetadata: Add reqs_abort_done field for worker communication - NixlConnectorWorker: Track aborted requests and report via finished_sending This ensures: - The scheduler-connector contract is maintained - Connector participates in cleanup process - Blocks are freed through the established finished_sending mechanism Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

jianzs · 2026-03-05T02:57:16Z

@orozery Thanks for the detailed feedback! I've implemented your suggested approach:

Implementation Summary:

Scheduler.finish_requests: When request.is_finished(), set request.status = FINISHED_ABORTED so the connector can detect this case.
NixlConnectorScheduler.request_finished: Check if request.status == FINISHED_ABORTED, then:
- Add to self._reqs_abort_done: set[ReqId]
- Clean up internal state (_reqs_not_processed, _reqs_need_send)
- Return delay_free_blocks=False to not delay further
NixlConnectorMetadata: Added reqs_abort_done: set[ReqId] field to pass abort info to workers.
NixlConnectorWorker:
- start_load_kv: Extract reqs_abort_done from metadata and store in self._aborted_reqs
- get_finished: Add aborted requests to done_sending so the scheduler can free blocks via the normal contract

Key Design Decision:

A finished request can only exist in self.requests when the connector delays block freeing (P/D scenario)
Therefore, we use assert self.connector is not None to express this invariant
Blocks are freed through the normal finished_sending mechanism, maintaining the scheduler-connector contract

Testing:

Unit tests pass (test_abort_during_kv_transfer validates the flow)
E2E basic inference tests pass

Please let me know if this aligns with your expectations or if any adjustments are needed.

cc @NickLucche @njhill

mergify · 2026-03-05T03:16:13Z

Hi @jianzs, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

This addresses review feedback for PR vllm-project#35506. The original implementation broke the contract between scheduler and connector by directly freeing blocks without notifying the connector. Changes: - Scheduler: Set request.status to FINISHED_ABORTED and call connector for cleanup instead of immediately freeing blocks - NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for cleanup via finished_sending mechanism - NixlConnectorMetadata: Add reqs_abort_done field for worker communication - NixlConnectorWorker: Track aborted requests and report via finished_sending This ensures: - The scheduler-connector contract is maintained - Connector participates in cleanup process - Blocks are freed through the established finished_sending mechanism Signed-off-by: Shoujian Zheng <zheng.shoujian@outlook.com>

orozery · 2026-03-05T08:13:07Z

vllm/v1/core/sched/scheduler.py

+                # Notify connector to participate in cleanup. Blocks will be
+                # freed when connector reports finished_sending.
+                # A finished request can only exist in self.requests when
+                # connector delays block freeing (P/D scenario).
+                assert self.connector is not None
+                self._connector_finished(request)


I think we need to remove this.
Connectors may assume that request_finished is called only once.
See #33377 for example.

Instead, NixlConnector will have to poll each of its "being sent" requests to see if their status was changed.

jianzs requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners February 27, 2026 11:45

mergify bot added v1 kv-connector labels Feb 27, 2026

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

orozery requested a review from NickLucche March 3, 2026 10:09

jianzs force-pushed the feat/free-block-for-abort-early branch from 0ea19d8 to 8564936 Compare March 5, 2026 02:56

jianzs force-pushed the feat/free-block-for-abort-early branch from 8564936 to fa732c3 Compare March 5, 2026 03:19

jianzs force-pushed the feat/free-block-for-abort-early branch from fa732c3 to ece5205 Compare March 5, 2026 03:21

jianzs force-pushed the feat/free-block-for-abort-early branch from ece5205 to 753add1 Compare March 5, 2026 03:24

orozery reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Proactively free KV cache blocks when aborting finished requests#35506

[Core] Proactively free KV cache blocks when aborting finished requests#35506
jianzs wants to merge 2 commits intovllm-project:mainfrom
jianzs:feat/free-block-for-abort-early

jianzs commented Feb 27, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

orozery commented Feb 27, 2026

Uh oh!

jianzs commented Mar 3, 2026

Uh oh!

orozery commented Mar 3, 2026

Uh oh!

jianzs commented Mar 5, 2026

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

orozery Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jianzs commented Feb 27, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

orozery commented Feb 27, 2026

Uh oh!

jianzs commented Mar 3, 2026

Uh oh!

orozery commented Mar 3, 2026

Uh oh!

jianzs commented Mar 5, 2026

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

orozery Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jianzs commented Feb 27, 2026 •

edited by github-actions bot

Loading