Skip to content

[Core] Proactively free KV cache blocks when aborting finished requests#35506

Open
jianzs wants to merge 2 commits intovllm-project:mainfrom
jianzs:feat/free-block-for-abort-early
Open

[Core] Proactively free KV cache blocks when aborting finished requests#35506
jianzs wants to merge 2 commits intovllm-project:mainfrom
jianzs:feat/free-block-for-abort-early

Conversation

@jianzs
Copy link
Contributor

@jianzs jianzs commented Feb 27, 2026

This is an enhancement to PR #25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request.

This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances memory management by proactively freeing KV cache blocks for requests that are aborted after they have already finished. The changes in vllm/v1/core/sched/scheduler.py correctly implement this by adding logic to finish_requests to immediately free blocks for finished requests upon receiving a FINISHED_ABORTED status. The related change to handle a potential race condition in _update_from_kv_xfer_finished is also correct and makes the system more robust. The test case in tests/v1/kv_connector/unit/test_remote_decode_lifecycle.py has been updated appropriately to validate the new behavior. Overall, the changes are well-implemented and achieve the goal of earlier memory reclamation.

@orozery
Copy link
Collaborator

orozery commented Feb 27, 2026

This changes basically allows the scheduler to unilateraly break the contract "async-save" contract between the scheduler and the connector.
The contract is:
If the connector returns delay_free_blocks=True when the request finishes, the request blocks will not free until the connector returns the request id in finished_sending.
With this unilateral break of contract, you replace an assert with a warning.
And now it is up to the connector to figure things out on this contract break.

I think a better alternative is to allow the scheduler to report the "abort after request finished" case to the connector, to allow the connector to clean up and make sure both sides agree on cleanup.

@jianzs
Copy link
Contributor Author

jianzs commented Mar 3, 2026

This changes basically allows the scheduler to unilateraly break the contract "async-save" contract between the scheduler and the connector. The contract is: If the connector returns delay_free_blocks=True when the request finishes, the request blocks will not free until the connector returns the request id in finished_sending. With this unilateral break of contract, you replace an assert with a warning. And now it is up to the connector to figure things out on this contract break.

I think a better alternative is to allow the scheduler to report the "abort after request finished" case to the connector, to allow the connector to clean up and make sure both sides agree on cleanup.

@orozery Thanks for the feedback—do you have any thoughts on how to send this message to the connector?

@orozery
Copy link
Collaborator

orozery commented Mar 3, 2026

@orozery Thanks for the feedback—do you have any thoughts on how to send this message to the connector?

This will require some changes in the connector API, which we are trying to minimize for now while we are re-iterating its design towards a new connector v2 API.
This is one point we should consider for the v2 API.

For a short-term solution, I think we can handle it this way:

  1. At Scheduler.finish_requests, if request.is_finished() set request.status = finished_status so that the Request object will reflect the fact that the request was aborted.
  2. At NixlConnectorScheduler.request_finished save the request at a new self._reqs_being_sent: dict[req_id, Request].
  3. At NixlConnectorScheduler.build_connector_meta check if any of the requests from self._reqs_being_sent was aborted (by checking their status). For any such aborted request, pass this information to the workers via a new NixlConnectorMetadata.reqs_to_abort: set[ReqId].
  4. At NixlConnectorWorker.get_finished add the request IDs of all aborted requests to done_sending, while also performing any necessary cleanups.

cc @NickLucche @njhill

@orozery orozery requested a review from NickLucche March 3, 2026 10:09
This is an enhancement to PR vllm-project#25067 which ignored aborts on finished
requests and relied on timeout-based cleanup. Instead of waiting for
the connector timeout to free blocks, immediately free them when
receiving FINISHED_ABORTED for an already-finished request.

This enables earlier KV cache memory reclamation, which is especially
important under heavy load in multi-node scenarios where memory
pressure is high.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
jianzs pushed a commit to jianzs/vllm that referenced this pull request Mar 5, 2026
This addresses review feedback for PR vllm-project#35506. The original implementation
broke the contract between scheduler and connector by directly freeing
blocks without notifying the connector.

Changes:
- Scheduler: Set request.status to FINISHED_ABORTED and call connector
  for cleanup instead of immediately freeing blocks
- NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for
  cleanup via finished_sending mechanism
- NixlConnectorMetadata: Add reqs_abort_done field for worker communication
- NixlConnectorWorker: Track aborted requests and report via finished_sending

This ensures:
- The scheduler-connector contract is maintained
- Connector participates in cleanup process
- Blocks are freed through the established finished_sending mechanism

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
@jianzs jianzs force-pushed the feat/free-block-for-abort-early branch from 0ea19d8 to 8564936 Compare March 5, 2026 02:56
@jianzs
Copy link
Contributor Author

jianzs commented Mar 5, 2026

@orozery Thanks for the detailed feedback! I've implemented your suggested approach:

Implementation Summary:

  1. Scheduler.finish_requests: When request.is_finished(), set request.status = FINISHED_ABORTED so the connector can detect this case.

  2. NixlConnectorScheduler.request_finished: Check if request.status == FINISHED_ABORTED, then:

    • Add to self._reqs_abort_done: set[ReqId]
    • Clean up internal state (_reqs_not_processed, _reqs_need_send)
    • Return delay_free_blocks=False to not delay further
  3. NixlConnectorMetadata: Added reqs_abort_done: set[ReqId] field to pass abort info to workers.

  4. NixlConnectorWorker:

    • start_load_kv: Extract reqs_abort_done from metadata and store in self._aborted_reqs
    • get_finished: Add aborted requests to done_sending so the scheduler can free blocks via the normal contract

Key Design Decision:

  • A finished request can only exist in self.requests when the connector delays block freeing (P/D scenario)
  • Therefore, we use assert self.connector is not None to express this invariant
  • Blocks are freed through the normal finished_sending mechanism, maintaining the scheduler-connector contract

Testing:

  • Unit tests pass (test_abort_during_kv_transfer validates the flow)
  • E2E basic inference tests pass

Please let me know if this aligns with your expectations or if any adjustments are needed.

cc @NickLucche @njhill

@mergify
Copy link

mergify bot commented Mar 5, 2026

Hi @jianzs, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

jianzs added a commit to jianzs/vllm that referenced this pull request Mar 5, 2026
This addresses review feedback for PR vllm-project#35506. The original implementation
broke the contract between scheduler and connector by directly freeing
blocks without notifying the connector.

Changes:
- Scheduler: Set request.status to FINISHED_ABORTED and call connector
  for cleanup instead of immediately freeing blocks
- NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for
  cleanup via finished_sending mechanism
- NixlConnectorMetadata: Add reqs_abort_done field for worker communication
- NixlConnectorWorker: Track aborted requests and report via finished_sending

This ensures:
- The scheduler-connector contract is maintained
- Connector participates in cleanup process
- Blocks are freed through the established finished_sending mechanism

Signed-off-by: Shoujian Zheng <zheng.shoujian@outlook.com>
@jianzs jianzs force-pushed the feat/free-block-for-abort-early branch from 8564936 to fa732c3 Compare March 5, 2026 03:19
jianzs added a commit to jianzs/vllm that referenced this pull request Mar 5, 2026
This addresses review feedback for PR vllm-project#35506. The original implementation
broke the contract between scheduler and connector by directly freeing
blocks without notifying the connector.

Changes:
- Scheduler: Set request.status to FINISHED_ABORTED and call connector
  for cleanup instead of immediately freeing blocks
- NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for
  cleanup via finished_sending mechanism
- NixlConnectorMetadata: Add reqs_abort_done field for worker communication
- NixlConnectorWorker: Track aborted requests and report via finished_sending

This ensures:
- The scheduler-connector contract is maintained
- Connector participates in cleanup process
- Blocks are freed through the established finished_sending mechanism

Signed-off-by: Shoujian Zheng <zheng.shoujian@outlook.com>
@jianzs jianzs force-pushed the feat/free-block-for-abort-early branch from fa732c3 to ece5205 Compare March 5, 2026 03:21
This addresses review feedback for PR vllm-project#35506. The original implementation
broke the contract between scheduler and connector by directly freeing
blocks without notifying the connector.

Changes:
- Scheduler: Set request.status to FINISHED_ABORTED and call connector
  for cleanup instead of immediately freeing blocks
- NixlConnectorScheduler: Detect FINISHED_ABORTED status and mark for
  cleanup via finished_sending mechanism
- NixlConnectorMetadata: Add reqs_abort_done field for worker communication
- NixlConnectorWorker: Track aborted requests and report via finished_sending

This ensures:
- The scheduler-connector contract is maintained
- Connector participates in cleanup process
- Blocks are freed through the established finished_sending mechanism

Signed-off-by: Shoujian Zheng <zheng.shoujian@outlook.com>
@jianzs jianzs force-pushed the feat/free-block-for-abort-early branch from ece5205 to 753add1 Compare March 5, 2026 03:24
Comment on lines +1727 to +1732
# Notify connector to participate in cleanup. Blocks will be
# freed when connector reports finished_sending.
# A finished request can only exist in self.requests when
# connector delays block freeing (P/D scenario).
assert self.connector is not None
self._connector_finished(request)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to remove this.
Connectors may assume that request_finished is called only once.
See #33377 for example.

Instead, NixlConnector will have to poll each of its "being sent" requests to see if their status was changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants