[NIXL] Ignore abort on already-finished request#25067
[NIXL] Ignore abort on already-finished request#25067njhill merged 3 commits intovllm-project:mainfrom
Conversation
|
Related to llm-d/llm-d#187 |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical race condition that could lead to a KeyError when a request is aborted during a KV cache transfer. The core change in vllm/v1/core/sched/scheduler.py to skip aborting already-finished requests is a clean and correct solution. The addition of the test_abort_during_kv_transfer unit test is excellent, as it specifically validates the fix for the identified scenario. The other changes, including test assertions and logging improvements, further enhance the robustness and debuggability of the codebase. Overall, this is a high-quality contribution that improves the stability of the system under heavy load.
NickLucche
left a comment
There was a problem hiding this comment.
Not sure how this wouldn't affect regular abort workflow..
Let's quickly discuss offline with @njhill
|
Based on chat with @NickLucche there's a very obvious question I need to get to the bottom of:
|
|
Closing this for now - we really don't know what's going on here, my theory above doesn't make much sense, and we haven't been able to reproduce it lately. |
|
A reminder, we're talking about this scenario: After the scheduler (engine core) finishes a request after prefill (due to max_tokens=1), the blocks are kept around for the decode worker, and when (much later) the timeout expires, the request has already been freed in the scheduler (or, at least, removed from I've managed to reproduce this, albeit quite artificially - but I think it's clear there is a race condition here, and it's the only reasonable possible way I see of hitting this situation. The race condition is something like this (GPT-5's effort to illustrate it) Why do I think this is the likely scenario? The engine core My conclusion - we need to handle the case where an exception or client disconnect causes the engine core to receive an abort after a request has finished |
|
See #26012 (comment) - I think #26012 has introduced another |
be65ca4 to
887f18d
Compare
vllm/v1/core/sched/scheduler.py
Outdated
| logger.debug("Aborting a previously finished request %s", req_id) | ||
| request.status = finished_status | ||
| self._connector_finished(request) | ||
| self._free_blocks(request) |
There was a problem hiding this comment.
I actually think it may be better here to just do nothing (i.e. just continue / skip this req). The expiration will handle blocks freeing once the timeout triggers. We could free them immediately but that would require keeping track of another state possible state (or else we would need to ignore double-frees). And this is only an edge case anyhow.
I think for sure we shouldn't call _connector_finished here since that would have already been called (the reason for the race condition in the first place).
There was a problem hiding this comment.
I actually think it may be better here to just do nothing (i.e. just continue / skip this req).
Yeah, that's what I had this PR do in its first iteration, but that was before I understood that the decode worker could not continue in this situation, so we know we can free
The expiration will handle blocks freeing once the timeout triggers.
I really don't love the prospect of a rare scenario whereby potentially many client-aborted requests stranded blocks for up to 8 minutes. You might kill a benchmark run, restart it, and get unpredictable results?
We could free them immediately but that would require keeping track of another state possible state (or else we would need to ignore double-frees).
We already guard against the (impossible) possibility of another abort by checking whether the request is in scheduler.requests
The "but the request is already freed" check in _update_from_kv_xfer_finished() is an example of us ignoring a double free already - I actually think we could make that unnecessary, but it's fine as defensive programming
Not sure I can see another issue?
And this is only an edge case anyhow.
We do need to do something to handle this edge case though - personally I think handling the abort by freeing is likely to be less brittle than just ignoring the abort and leaving it for the expiry timer to handle
I think for sure we shouldn't call
_connector_finishedhere since that would have already been called (the reason for the race condition in the first place).
If the request is aborted, and you free the request, then the connector should also delete the timer - calling request_finished() is what tells the connector to do that, e.g. in NIXL:
if request.status != RequestStatus.FINISHED_LENGTH_CAPPED:
# Also include the case of a P/D Prefill request with immediate
# block free (eg abort). Stop tracking this request.
self._reqs_not_processed.add(request.request_id)
return False, None
There was a problem hiding this comment.
the decode worker could not continue in this situation, so we know we can free
This is true in the OpenAI API server + NixlConnector P/D case but not in the general case. If folks are using the AsyncLLM interface directly they can call abort "out of band", but then would still get the output for the request with any kv transfer params that the connector had returned.
For connectors in general, the contract is that if request_finished() returned async_save=True then the connector may be using/saving the blocks asynchronously until it notifies the framework that it's finished with them. So I'm not sure it would be "safe" to free them here since in a sense the connector owns the blocks at this point.
I really don't love the prospect of a rare scenario whereby potentially many client-aborted requests stranded blocks for up to 8 minutes. You might kill a benchmark run, restart it, and get unpredictable results?
This is fair! But in theory this should be a very narrow window. Given this, and looking closer at the flow, I think the reason we've encountered it is likely that in the core model loop, abort (and other new) requests aren't processed between the model execution / forward pass and the call to scheduler.update_from_outputs() which handles the request completion including notifying the scheduler.
We should probably look at changing it to process pending requests in-between so that in this case, the request status will have already been updated to ABORTED when it's passed to request_finished() (and then the nixl connector will return async_save=False so the blocks will be freed immediately.
If the request is aborted, and you free the request, then the connector should also delete the timer - calling
request_finished()is what tells the connector to do that
My assumption of the contract of this method is that it should be called exactly once for each request.
There was a problem hiding this comment.
the decode worker could not continue in this situation, so we know we can free
This is true in the OpenAI API server + NixlConnector P/D case but not in the general case. If folks are using the
AsyncLLMinterface directly they can call abort "out of band", but then would still get the output for the request with any kv transfer params that the connector had returned.
No, because any finished request gets removed from the output processor before it's returned, and then it's filtered out of any call to abort()
For connectors in general, the contract is that if
request_finished()returnedasync_save=Truethen the connector may be using/saving the blocks asynchronously until it notifies the framework that it's finished with them. So I'm not sure it would be "safe" to free them here since in a sense the connector owns the blocks at this point.
I appreciate you're focused on a clear API contract with connectors, especially since we allow for out-of-repo connectors
I really don't love the prospect of a rare scenario whereby potentially many client-aborted requests stranded blocks for up to 8 minutes. You might kill a benchmark run, restart it, and get unpredictable results?
This is fair! But in theory this should be a very narrow window. Given this, and looking closer at the flow, I think the reason we've encountered it is likely that in the core model loop, abort (and other new) requests aren't processed between the model execution / forward pass and the call to
scheduler.update_from_outputs()which handles the request completion including notifying the scheduler.We should probably look at changing it to process pending requests in-between so that in this case, the request status will have already been updated to ABORTED when it's passed to
request_finished()(and then the nixl connector will returnasync_save=Falseso the blocks will be freed immediately.If the request is aborted, and you free the request, then the connector should also delete the timer - calling
request_finished()is what tells the connector to do thatMy assumption of the contract of this method is that it should be called exactly once for each request.
Ok, I'll change back to ignoring the abort and update the connector API contract documentation to say this outright
There was a problem hiding this comment.
We should probably look at changing it to process pending requests in-between so that in this case
Filed as #26400
We have observed a rare scenario with AsyncLLM where a client disconnect triggers an abort request after the request has finished, but before AsyncLLM has processed the request output. See vllm-project#26012, vllm-project#25067, vllm-project#25844, and llm-d/llm-d#187. Without the fix, the unit test fails with: ``` logger.warning( "Releasing expired KV blocks for request %s which were " "retrieved by %d decode worker(s) within %d seconds.", req_id, count, envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT, ) > self._reqs_to_process.remove(req_id) E KeyError: '0' vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
This situation can occur when the API server receives a client
disconnect (and thus sends an abort) around the same time a prefill
completes and we keep the blocks (delay_free_blocks) around for a
remote decode. We should assume the blocks may be used, and so
we ignore the abort. If they are not used, they should be freed
by the connector after a timeout.
The original error was:
```
[scheduler.py:1183] Finished sending KV transfer for request cmpl-37c560d3-5680-4bd1-97f9-7ed31a56de60-0
File "/opt/vllm-source/vllm/v1/engine/core.py", line 292, in step
engine_core_outputs = self.scheduler.update_from_output(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm-source/vllm/v1/core/sched/scheduler.py", line 893, in update_from_output
self._update_from_kv_xfer_finished(
File "/opt/vllm-source/vllm/v1/core/sched/scheduler.py", line 1184, in _update_from_kv_xfer_finish>
self._free_blocks(self.requests[req_id])
~~~~~~~~~~~~~^^^^^^^^
KeyError: 'cmpl-37c560d3-5680-4bd1-97f9-7ed31a56de60-0'
```
But since vllm-project#25844 we would log a warning. This fix makes it so
that situation in `_update_from_kv_xfer_finish()` should never
occur.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
1. request_finished() should be called exactly once 2. Returning True from request_finished() means the connector assumes responsibility for when the request should be freed Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Now that we prevent against abort-after-finished, the current assumptions in the NIXL connector are correct, but an assertion helps document the assumption more clearly. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
NickLucche
left a comment
There was a problem hiding this comment.
Clean, thanks for the fine work @markmc
| NUM_EXTERNAL_FULL_BLOCKS = 2 | ||
| NUM_TOKENS = int(BLOCK_SIZE * (NUM_EXTERNAL_FULL_BLOCKS + 0.5)) |
There was a problem hiding this comment.
nit: not really important for this test, we could simplify
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
This is an enhancement to PR vllm-project#25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request. This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
This is an enhancement to PR vllm-project#25067 which ignored aborts on finished requests and relied on timeout-based cleanup. Instead of waiting for the connector timeout to free blocks, immediately free them when receiving FINISHED_ABORTED for an already-finished request. This enables earlier KV cache memory reclamation, which is especially important under heavy load in multi-node scenarios where memory pressure is high. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
This situation can occur when the API server receives a client disconnect (and thus sends an abort) around the same time a prefill completes and we keep the blocks (delay_free_blocks) around for a remote decode. We should assume the blocks may be used, and so we ignore the abort. If they are not used, they should be freed by the connector after a timeout.
The original error was:
But since #25844 we would log a warning. This fix makes it so that situation in
_update_from_kv_xfer_finish()should never occur.Observed under heavy load in a multi-node, llm-d 4P1D test environment. See llm-d/llm-d#187
More recently, #26012 introduced another case where this situation would cause a crash: