Skip to content

[NIXL] Fix KeyError on abort-after-finished#26351

Closed
markmc wants to merge 2 commits intovllm-project:mainfrom
markmc:nixl-abort-after-finish
Closed

[NIXL] Fix KeyError on abort-after-finished#26351
markmc wants to merge 2 commits intovllm-project:mainfrom
markmc:nixl-abort-after-finish

Conversation

@markmc
Copy link
Member

@markmc markmc commented Oct 7, 2025

We have observed a rare scenario with AsyncLLM where a client disconnect triggers an abort after the request has finished, but before AsyncLLM has processed the request output.

See #26012, #25067, #25844, and llm-d/llm-d#187.

Without the fix, the unit test fails with:

            logger.warning(
                "Releasing expired KV blocks for request %s which were "
                "retrieved by %d decode worker(s) within %d seconds.",
                req_id,
                count,
                envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT,
            )
>           self._reqs_to_process.remove(req_id)
E           KeyError: '0'

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError

We have observed a rare scenario with AsyncLLM where a client disconnect
triggers an abort request after the request has finished, but before
AsyncLLM has processed the request output.

See vllm-project#26012, vllm-project#25067, vllm-project#25844, and llm-d/llm-d#187.

Without the fix, the unit test fails with:

```
            logger.warning(
                "Releasing expired KV blocks for request %s which were "
                "retrieved by %d decode worker(s) within %d seconds.",
                req_id,
                count,
                envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT,
            )
>           self._reqs_to_process.remove(req_id)
E           KeyError: '0'

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I've reviewed the changes and the fix for the KeyError in nixl_connector.py is correct and well-targeted. The race condition is properly handled by ensuring that when a request is marked as not to be processed, it's also removed from the _reqs_to_send dictionary, which prevents the timeout logic from attempting to access a stale entry.

The new unit test, test_abort_after_finish_on_prefiller, effectively reproduces the scenario described and serves as a solid regression test. It's great to see it's parameterized to run with and without Ray.

Overall, this is a high-quality contribution that addresses a tricky race condition. Well done!

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @markmc !

@NickLucche NickLucche enabled auto-merge (squash) October 7, 2025 13:01
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025
It turns out that we're not shutting down the handshake
listener thread during tests, so adding a second test was
causing a hang because the port is in use.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
auto-merge was automatically disabled October 7, 2025 14:22

Head branch was pushed to by a user without write access

@markmc
Copy link
Member Author

markmc commented Oct 7, 2025

@NickLucche thanks. Note the follow up commit where I had to add handshake thread shutdown logic

@njhill
Copy link
Member

njhill commented Oct 8, 2025

@markmc I was wondering whether this error still exists if we no-op when aborting when the request is already in a finished state, as we are discussing in #25067?

@markmc
Copy link
Member Author

markmc commented Oct 8, 2025

@markmc I was wondering whether this error still exists if we no-op when aborting when the request is already in a finished state, as we are discussing in #25067?

Yep, that's correct. I've added an assertion in #25067 to document the assumption that a timer will never be set before an abort is received. Closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants