[NIXL] Ignore abort on already-finished request by markmc · Pull Request #25067 · vllm-project/vllm

markmc · 2025-09-17T11:07:41Z

This situation can occur when the API server receives a client disconnect (and thus sends an abort) around the same time a prefill completes and we keep the blocks (delay_free_blocks) around for a remote decode. We should assume the blocks may be used, and so we ignore the abort. If they are not used, they should be freed by the connector after a timeout.

The original error was:

[scheduler.py:1183] Finished sending KV transfer for request cmpl-37c560d3-5680-4bd1-97f9-7ed31a56de60-0
  File "/opt/vllm-source/vllm/v1/engine/core.py", line 292, in step
     engine_core_outputs = self.scheduler.update_from_output(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm-source/vllm/v1/core/sched/scheduler.py", line 893, in update_from_output
    self._update_from_kv_xfer_finished(
  File "/opt/vllm-source/vllm/v1/core/sched/scheduler.py", line 1184, in _update_from_kv_xfer_finish>
    self._free_blocks(self.requests[req_id])
                      ~~~~~~~~~~~~~^^^^^^^^

KeyError: 'cmpl-37c560d3-5680-4bd1-97f9-7ed31a56de60-0'

But since #25844 we would log a warning. This fix makes it so that situation in _update_from_kv_xfer_finish() should never occur.

Observed under heavy load in a multi-node, llm-d 4P1D test environment. See llm-d/llm-d#187

More recently, #26012 introduced another case where this situation would cause a crash:

            logger.warning(
                "Releasing expired KV blocks for request %s which were "
                "retrieved by %d decode worker(s) within %d seconds.",
                req_id,
                count,
                envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT,
            )
>           self._reqs_to_process.remove(req_id)
E           KeyError: '0'

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError

markmc · 2025-09-17T11:08:55Z

Related to llm-d/llm-d#187

gemini-code-assist

Code Review

This pull request effectively addresses a critical race condition that could lead to a KeyError when a request is aborted during a KV cache transfer. The core change in vllm/v1/core/sched/scheduler.py to skip aborting already-finished requests is a clean and correct solution. The addition of the test_abort_during_kv_transfer unit test is excellent, as it specifically validates the fix for the identified scenario. The other changes, including test assertions and logging improvements, further enhance the robustness and debuggability of the codebase. Overall, this is a high-quality contribution that improves the stability of the system under heavy load.

NickLucche

Not sure how this wouldn't affect regular abort workflow..
Let's quickly discuss offline with @njhill

markmc · 2025-09-17T14:19:05Z

Based on chat with @NickLucche there's a very obvious question I need to get to the bottom of:

If P is done with the request (finished, capped length), then how is the request being aborted in P ?

markmc · 2025-09-19T10:43:51Z

Closing this for now - we really don't know what's going on here, my theory above doesn't make much sense, and we haven't been able to reproduce it lately.

markmc · 2025-10-06T13:10:12Z

A reminder, we're talking about this scenario:

    def _update_from_kv_xfer_finished(self, kv_connector_output: KVConnectorOutput):
        for req_id in kv_connector_output.finished_sending or ():
            logger.debug("Finished sending KV transfer for request %s", req_id)
            if req_id not in self.requests:
                logger.warning(
                    "Got finished sending KV transfer for request %s,"
                    "but the request is already freed.",
                    req_id,
                )
            else:
                self._free_blocks(self.requests[req_id])

After the scheduler (engine core) finishes a request after prefill (due to max_tokens=1), the blocks are kept around for the decode worker, and when (much later) the timeout expires, the request has already been freed in the scheduler (or, at least, removed from Scheduler.requests)

I've managed to reproduce this, albeit quite artificially - but I think it's clear there is a race condition here, and it's the only reasonable possible way I see of hitting this situation.

The race condition is something like this (GPT-5's effort to illustrate it)

                +--------------------+                    +--------------------+
                |     API Server     |                    |    Engine Core     |
                +--------------------+                    +--------------------+
                         |                                           |
                         |---- send request ------------------------>|
                         |                                           |
                         |<--- stream partial outputs ---------------|
                         |<--- stream partial outputs ---------------|
                         |                                           |
                         |<--- final output -------------------------|
                         |                                           |
        [Client disconnects just before or during final output]      |
                         |                                           |
                         |  asyncio.CancelledError / GeneratorExit   |
                         |                                           |
                         |---- send abort(request_id) -------------->|
                         |                                           |
                         |                    [race window]          |
                         |       Engine finishes and frees request   |
                         |         before abort arrives              |
                         |                                           |
                         |<------------------- done -----------------|
                         |                                           |
                         |---- abort arrives too late -------------->|
                         |                                           |
                         |        [request already freed]            |
                         |        -> spurious abort handling         |
                         |        -> potential log warning/error     |
                         |                                           |

Why do I think this is the likely scenario? The engine core abort_requests() is the only call to finish_requests() which contains the only call to free_request() other than update_from_output(). And the only other reason the API server would send an abort is if it hits a stop string on detokenization.

My conclusion - we need to handle the case where an exception or client disconnect causes the engine core to receive an abort after a request has finished

markmc · 2025-10-06T13:24:35Z

See #26012 (comment) - I think #26012 has introduced another KeyError crash in this scenario, but this time int he worker

njhill

Thanks @markmc. I spent some time studying this and I think you're right re this race condition and I can see how it could cause a double-free type key error.

njhill · 2025-10-06T23:40:36Z

vllm/v1/core/sched/scheduler.py

+                logger.debug("Aborting a previously finished request %s", req_id)
+                request.status = finished_status
+                self._connector_finished(request)
+                self._free_blocks(request)


I actually think it may be better here to just do nothing (i.e. just continue / skip this req). The expiration will handle blocks freeing once the timeout triggers. We could free them immediately but that would require keeping track of another state possible state (or else we would need to ignore double-frees). And this is only an edge case anyhow.

I think for sure we shouldn't call _connector_finished here since that would have already been called (the reason for the race condition in the first place).

I actually think it may be better here to just do nothing (i.e. just continue / skip this req).

Yeah, that's what I had this PR do in its first iteration, but that was before I understood that the decode worker could not continue in this situation, so we know we can free

The expiration will handle blocks freeing once the timeout triggers.

I really don't love the prospect of a rare scenario whereby potentially many client-aborted requests stranded blocks for up to 8 minutes. You might kill a benchmark run, restart it, and get unpredictable results?

We could free them immediately but that would require keeping track of another state possible state (or else we would need to ignore double-frees).

We already guard against the (impossible) possibility of another abort by checking whether the request is in scheduler.requests

The "but the request is already freed" check in _update_from_kv_xfer_finished() is an example of us ignoring a double free already - I actually think we could make that unnecessary, but it's fine as defensive programming

Not sure I can see another issue?

And this is only an edge case anyhow.

We do need to do something to handle this edge case though - personally I think handling the abort by freeing is likely to be less brittle than just ignoring the abort and leaving it for the expiry timer to handle

I think for sure we shouldn't call _connector_finished here since that would have already been called (the reason for the race condition in the first place).

If the request is aborted, and you free the request, then the connector should also delete the timer - calling request_finished() is what tells the connector to do that, e.g. in NIXL:

if request.status != RequestStatus.FINISHED_LENGTH_CAPPED: # Also include the case of a P/D Prefill request with immediate # block free (eg abort). Stop tracking this request. self._reqs_not_processed.add(request.request_id) return False, None

the decode worker could not continue in this situation, so we know we can free

This is true in the OpenAI API server + NixlConnector P/D case but not in the general case. If folks are using the AsyncLLM interface directly they can call abort "out of band", but then would still get the output for the request with any kv transfer params that the connector had returned.

For connectors in general, the contract is that if request_finished() returned async_save=True then the connector may be using/saving the blocks asynchronously until it notifies the framework that it's finished with them. So I'm not sure it would be "safe" to free them here since in a sense the connector owns the blocks at this point.

I really don't love the prospect of a rare scenario whereby potentially many client-aborted requests stranded blocks for up to 8 minutes. You might kill a benchmark run, restart it, and get unpredictable results?

This is fair! But in theory this should be a very narrow window. Given this, and looking closer at the flow, I think the reason we've encountered it is likely that in the core model loop, abort (and other new) requests aren't processed between the model execution / forward pass and the call to scheduler.update_from_outputs() which handles the request completion including notifying the scheduler.

We should probably look at changing it to process pending requests in-between so that in this case, the request status will have already been updated to ABORTED when it's passed to request_finished() (and then the nixl connector will return async_save=False so the blocks will be freed immediately.

If the request is aborted, and you free the request, then the connector should also delete the timer - calling request_finished() is what tells the connector to do that

My assumption of the contract of this method is that it should be called exactly once for each request.

the decode worker could not continue in this situation, so we know we can free

This is true in the OpenAI API server + NixlConnector P/D case but not in the general case. If folks are using the AsyncLLM interface directly they can call abort "out of band", but then would still get the output for the request with any kv transfer params that the connector had returned.

No, because any finished request gets removed from the output processor before it's returned, and then it's filtered out of any call to abort()

For connectors in general, the contract is that if request_finished() returned async_save=True then the connector may be using/saving the blocks asynchronously until it notifies the framework that it's finished with them. So I'm not sure it would be "safe" to free them here since in a sense the connector owns the blocks at this point.

I appreciate you're focused on a clear API contract with connectors, especially since we allow for out-of-repo connectors

I really don't love the prospect of a rare scenario whereby potentially many client-aborted requests stranded blocks for up to 8 minutes. You might kill a benchmark run, restart it, and get unpredictable results?

This is fair! But in theory this should be a very narrow window. Given this, and looking closer at the flow, I think the reason we've encountered it is likely that in the core model loop, abort (and other new) requests aren't processed between the model execution / forward pass and the call to scheduler.update_from_outputs() which handles the request completion including notifying the scheduler.

We should probably look at changing it to process pending requests in-between so that in this case, the request status will have already been updated to ABORTED when it's passed to request_finished() (and then the nixl connector will return async_save=False so the blocks will be freed immediately.

If the request is aborted, and you free the request, then the connector should also delete the timer - calling request_finished() is what tells the connector to do that

My assumption of the contract of this method is that it should be called exactly once for each request.

Ok, I'll change back to ignoring the abort and update the connector API contract documentation to say this outright

We should probably look at changing it to process pending requests in-between so that in this case

Filed as #26400

vllm/v1/core/sched/scheduler.py

We have observed a rare scenario with AsyncLLM where a client disconnect triggers an abort request after the request has finished, but before AsyncLLM has processed the request output. See vllm-project#26012, vllm-project#25067, vllm-project#25844, and llm-d/llm-d#187. Without the fix, the unit test fails with: ``` logger.warning( "Releasing expired KV blocks for request %s which were " "retrieved by %d decode worker(s) within %d seconds.", req_id, count, envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT, ) > self._reqs_to_process.remove(req_id) E KeyError: '0' vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>

This situation can occur when the API server receives a client disconnect (and thus sends an abort) around the same time a prefill completes and we keep the blocks (delay_free_blocks) around for a remote decode. We should assume the blocks may be used, and so we ignore the abort. If they are not used, they should be freed by the connector after a timeout. The original error was: ``` [scheduler.py:1183] Finished sending KV transfer for request cmpl-37c560d3-5680-4bd1-97f9-7ed31a56de60-0 File "/opt/vllm-source/vllm/v1/engine/core.py", line 292, in step engine_core_outputs = self.scheduler.update_from_output( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/vllm-source/vllm/v1/core/sched/scheduler.py", line 893, in update_from_output self._update_from_kv_xfer_finished( File "/opt/vllm-source/vllm/v1/core/sched/scheduler.py", line 1184, in _update_from_kv_xfer_finish> self._free_blocks(self.requests[req_id]) ~~~~~~~~~~~~~^^^^^^^^ KeyError: 'cmpl-37c560d3-5680-4bd1-97f9-7ed31a56de60-0' ``` But since vllm-project#25844 we would log a warning. This fix makes it so that situation in `_update_from_kv_xfer_finish()` should never occur. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

1. request_finished() should be called exactly once 2. Returning True from request_finished() means the connector assumes responsibility for when the request should be freed Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Now that we prevent against abort-after-finished, the current assumptions in the NIXL connector are correct, but an assertion helps document the assumption more clearly. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

njhill

Thanks @markmc! and thanks for the doc improvements!

NickLucche

Clean, thanks for the fine work @markmc

NickLucche · 2025-10-10T10:14:12Z

tests/v1/kv_connector/unit/test_remote_decode_lifecycle.py

+    NUM_EXTERNAL_FULL_BLOCKS = 2
+    NUM_TOKENS = int(BLOCK_SIZE * (NUM_EXTERNAL_FULL_BLOCKS + 0.5))


nit: not really important for this test, we could simplify

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: bbartels <benjamin@bartels.dev>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

This is an enhancement to PR vllm-project#25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request. This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

markmc requested review from NickLucche, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners September 17, 2025 11:07

mergify bot added the v1 label Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

This was referenced Sep 17, 2025

[Bug]: KeyError in NixlConnector leads to fatal crash in prefill worker during distributed inference (100% repetitive issue underload) llm-d/llm-d#187

Closed

[Nixl] make deletion atomic in nixl request timeout handling #24268

Closed

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025

NickLucche requested changes Sep 17, 2025

View reviewed changes

mergify bot added the kv-connector label Sep 18, 2025

markmc closed this Sep 19, 2025

markmc mentioned this pull request Sep 29, 2025

[P/D] NIXL Updates #25844

Merged

5 tasks

This was referenced Oct 6, 2025

[NIXL] refactor scheduler->worker request state synchronization #26172

Open

[Bugfix] Fix _reqs_to_process leak on abort #26012

Merged

markmc reopened this Oct 6, 2025

markmc requested a review from ApostaC as a code owner October 6, 2025 14:14

markmc marked this pull request as draft October 6, 2025 14:14

markmc force-pushed the pd-abort branch 2 times, most recently from be65ca4 to 887f18d Compare October 6, 2025 14:15

markmc changed the title ~~[V1][NIXL] Keep prefilled blocks around even if aborted~~ [NIXL] Free prefilled blocks if aborted Oct 6, 2025

markmc marked this pull request as ready for review October 6, 2025 19:07

njhill reviewed Oct 6, 2025

View reviewed changes

markmc mentioned this pull request Oct 7, 2025

[NIXL] Fix KeyError on abort-after-finished #26351

Closed

markmc force-pushed the pd-abort branch from 2efc8a9 to 103f800 Compare October 8, 2025 07:26

markmc changed the title ~~[NIXL] Free prefilled blocks if aborted~~ [NIXL] Ignore abort on already-finished request Oct 8, 2025

markmc mentioned this pull request Oct 8, 2025

[Engine Core] Process pending requests in-between model execution and update_from_outputs #26400

Closed

markmc added 2 commits October 8, 2025 04:36

[KV Transfer] Add some more detail of connector API contract

5d86672

1. request_finished() should be called exactly once 2. Returning True from request_finished() means the connector assumes responsibility for when the request should be freed Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[NIXL] Add assertion for abort-after-finished case

4496ca1

Now that we prevent against abort-after-finished, the current assumptions in the NIXL connector are correct, but an assertion helps document the assumption more clearly. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

njhill approved these changes Oct 10, 2025

View reviewed changes

njhill enabled auto-merge (squash) October 10, 2025 02:49

NickLucche approved these changes Oct 10, 2025

View reviewed changes

njhill merged commit 784c231 into vllm-project:main Oct 10, 2025
52 checks passed

markmc mentioned this pull request Oct 13, 2025

[Feature] Add abort request endpoint to handle request cancellations #26635

Closed

5 tasks

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025

[NIXL] Ignore abort on already-finished request (vllm-project#25067)

2beec9b

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[NIXL] Ignore abort on already-finished request (vllm-project#25067)

207a288

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: bbartels <benjamin@bartels.dev>

markmc mentioned this pull request Oct 16, 2025

[Bugfix] Fix race condition when KV transfer times out before request finishes #26929

Closed

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[NIXL] Ignore abort on already-finished request (vllm-project#25067)

f1743cf

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[NIXL] Ignore abort on already-finished request (vllm-project#25067)

00e3b8f

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[NIXL] Ignore abort on already-finished request (vllm-project#25067)

305f138

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[NIXL] Ignore abort on already-finished request (vllm-project#25067)

5349652

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

jianzs mentioned this pull request Feb 27, 2026

[Core] Proactively free KV cache blocks when aborting finished requests #35506

Open

lg-epic mentioned this pull request Feb 27, 2026

fix(kvbm): prevent scheduler crash from race between abort and request completion ai-dynamo/dynamo#6681

Draft

3 tasks

markmc mentioned this pull request Mar 23, 2026

[Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837) #37859

Draft

5 tasks

		NUM_EXTERNAL_FULL_BLOCKS = 2
		NUM_TOKENS = int(BLOCK_SIZE * (NUM_EXTERNAL_FULL_BLOCKS + 0.5))

Uh oh!

Conversation

markmc commented Sep 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Sep 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

markmc commented Sep 17, 2025

Uh oh!

markmc commented Sep 19, 2025

Uh oh!

markmc commented Oct 6, 2025

Uh oh!

markmc commented Oct 6, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

markmc Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

markmc Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

markmc Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markmc commented Sep 17, 2025 •

edited by github-actions bot

Loading