[NIXL] Fix KeyError on abort-after-finished#26351
[NIXL] Fix KeyError on abort-after-finished#26351markmc wants to merge 2 commits intovllm-project:mainfrom
Conversation
We have observed a rare scenario with AsyncLLM where a client disconnect triggers an abort request after the request has finished, but before AsyncLLM has processed the request output. See vllm-project#26012, vllm-project#25067, vllm-project#25844, and llm-d/llm-d#187. Without the fix, the unit test fails with: ``` logger.warning( "Releasing expired KV blocks for request %s which were " "retrieved by %d decode worker(s) within %d seconds.", req_id, count, envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT, ) > self._reqs_to_process.remove(req_id) E KeyError: '0' vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1238: KeyError ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
There was a problem hiding this comment.
Code Review
I've reviewed the changes and the fix for the KeyError in nixl_connector.py is correct and well-targeted. The race condition is properly handled by ensuring that when a request is marked as not to be processed, it's also removed from the _reqs_to_send dictionary, which prevents the timeout logic from attempting to access a stale entry.
The new unit test, test_abort_after_finish_on_prefiller, effectively reproduces the scenario described and serves as a solid regression test. It's great to see it's parameterized to run with and without Ray.
Overall, this is a high-quality contribution that addresses a tricky race condition. Well done!
It turns out that we're not shutting down the handshake listener thread during tests, so adding a second test was causing a hang because the port is in use. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Head branch was pushed to by a user without write access
|
@NickLucche thanks. Note the follow up commit where I had to add handshake thread shutdown logic |
Yep, that's correct. I've added an assertion in #25067 to document the assumption that a timer will never be set before an abort is received. Closing this |
We have observed a rare scenario with AsyncLLM where a client disconnect triggers an abort after the request has finished, but before AsyncLLM has processed the request output.
See #26012, #25067, #25844, and llm-d/llm-d#187.
Without the fix, the unit test fails with: