[NIXL] Fix KeyError on abort-after-finished by markmc · Pull Request #26351 · vllm-project/vllm

markmc · 2025-10-07T11:26:15Z

We have observed a rare scenario with AsyncLLM where a client disconnect triggers an abort after the request has finished, but before AsyncLLM has processed the request output.

See #26012, #25067, #25844, and llm-d/llm-d#187.

Without the fix, the unit test fails with:

            logger.warning(
                "Releasing expired KV blocks for request %s which were "
                "retrieved by %d decode worker(s) within %d seconds.",
                req_id,
                count,
                envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT,
            )
>           self._reqs_to_process.remove(req_id)
E           KeyError: '0'

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError

We have observed a rare scenario with AsyncLLM where a client disconnect triggers an abort request after the request has finished, but before AsyncLLM has processed the request output. See vllm-project#26012, vllm-project#25067, vllm-project#25844, and llm-d/llm-d#187. Without the fix, the unit test fails with: ``` logger.warning( "Releasing expired KV blocks for request %s which were " "retrieved by %d decode worker(s) within %d seconds.", req_id, count, envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT, ) > self._reqs_to_process.remove(req_id) E KeyError: '0' vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>

gemini-code-assist

Code Review

I've reviewed the changes and the fix for the KeyError in nixl_connector.py is correct and well-targeted. The race condition is properly handled by ensuring that when a request is marked as not to be processed, it's also removed from the _reqs_to_send dictionary, which prevents the timeout logic from attempting to access a stale entry.

The new unit test, test_abort_after_finish_on_prefiller, effectively reproduces the scenario described and serves as a solid regression test. It's great to see it's parameterized to run with and without Ray.

Overall, this is a high-quality contribution that addresses a tricky race condition. Well done!

NickLucche

Thank you @markmc !

It turns out that we're not shutting down the handshake listener thread during tests, so adding a second test was causing a hang because the port is in use. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-10-07T14:23:19Z

@NickLucche thanks. Note the follow up commit where I had to add handshake thread shutdown logic

njhill · 2025-10-08T04:45:44Z

@markmc I was wondering whether this error still exists if we no-op when aborting when the request is already in a finished state, as we are discussing in #25067?

markmc · 2025-10-08T09:11:46Z

@markmc I was wondering whether this error still exists if we no-op when aborting when the request is already in a finished state, as we are discussing in #25067?

Yep, that's correct. I've added an assertion in #25067 to document the assumption that a timer will never be set before an abort is received. Closing this

markmc requested review from ApostaC and NickLucche as code owners October 7, 2025 11:26

mergify bot added v1 kv-connector labels Oct 7, 2025

markmc mentioned this pull request Oct 7, 2025

[Bugfix] Fix _reqs_to_process leak on abort #26012

Merged

markmc requested review from NickLucche and njhill and removed request for ApostaC and NickLucche October 7, 2025 11:27

gemini-code-assist bot reviewed Oct 7, 2025

View reviewed changes

NickLucche approved these changes Oct 7, 2025

View reviewed changes

NickLucche enabled auto-merge (squash) October 7, 2025 13:01

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025

[NIXL] Shutdown handshake listener after tests

c02893f

It turns out that we're not shutting down the handshake listener thread during tests, so adding a second test was causing a hang because the port is in use. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

auto-merge was automatically disabled October 7, 2025 14:22
Head branch was pushed to by a user without write access

markmc closed this Oct 8, 2025

markmc mentioned this pull request Oct 8, 2025

[NIXL] Terminate handshake listener thread in shutdown #26404

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NIXL] Fix KeyError on abort-after-finished#26351

[NIXL] Fix KeyError on abort-after-finished#26351
markmc wants to merge 2 commits intovllm-project:mainfrom
markmc:nixl-abort-after-finish

markmc commented Oct 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

NickLucche left a comment

Uh oh!

markmc commented Oct 7, 2025

Uh oh!

njhill commented Oct 8, 2025

Uh oh!

markmc commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

markmc commented Oct 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

markmc commented Oct 7, 2025

Uh oh!

njhill commented Oct 8, 2025

Uh oh!

markmc commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markmc commented Oct 7, 2025 •

edited by github-actions bot

Loading