Skip to content

[NIXL] Terminate handshake listener thread in shutdown#26404

Merged
NickLucche merged 1 commit intovllm-project:mainfrom
markmc:nixl-shutdown-handshake-listener-thread
Oct 22, 2025
Merged

[NIXL] Terminate handshake listener thread in shutdown#26404
NickLucche merged 1 commit intovllm-project:mainfrom
markmc:nixl-shutdown-handshake-listener-thread

Conversation

@markmc
Copy link
Member

@markmc markmc commented Oct 8, 2025

It turns out that we're not terminating the handshake listener thread during shutdown.

I discovered this while adding a (now unneeded) abort-after-finished variant of the abort_timeout_on_prefiller test and saw a port-in-use hang.

xref #26351

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thank you for this contribution. The changes correctly address the issue of the handshake listener thread not being terminated during shutdown. I've identified one high-severity issue related to the use of __del__ for resource cleanup, which is an unreliable practice in Python and can introduce thread-safety problems. Please see my detailed comment below.

Comment on lines +1546 to +1547
def __del__(self):
self.shutdown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using __del__ for resource cleanup is an anti-pattern in Python because its execution is not guaranteed. It may not be called if the object is part of a reference cycle, or during interpreter shutdown, which can lead to resource leaks that are hard to debug. The explicit shutdown() calls, like the one you've added in the tests, are the correct and reliable way to handle cleanup.

Furthermore, __del__ can be invoked by the garbage collector at any time and from any thread. This makes the thread-safety of shutdown() critical, but the current implementation is not thread-safe. For example:

  • One thread could be iterating over self._recving_transfers while another clears it, causing a RuntimeError.
  • self._nixl_handshake_listener_t could be set to None by one thread after another has checked it for None but before calling .join() on it, resulting in an AttributeError.

I strongly recommend removing the __del__ method and relying solely on explicit shutdown() calls. If shutdown() might be called concurrently from other paths, it should be protected with a threading.Lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling shutdown() from here is convenient for tests - it means resources are cleaned up even without explicit shutdown calls. Outside of tests, we should be calling shutdown() explicitly

Another example of us doing this:

vllm/v1/engine/async_llm.py:235:    def __del__(self):
vllm/v1/engine/async_llm.py-236-        self.shutdown()

@markmc markmc force-pushed the nixl-shutdown-handshake-listener-thread branch from cad5ba5 to a44e75b Compare October 8, 2025 09:58
@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
@markmc markmc force-pushed the nixl-shutdown-handshake-listener-thread branch from a44e75b to 6336ae2 Compare October 15, 2025 10:29
@mergify mergify bot removed the needs-rebase label Oct 15, 2025
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 16, 2025
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @markmc , only left one comment.
Let's get ci green and merge

Comment on lines +711 to +713
events = dict(poller.poll(timeout=poll_timeout * 1000))
if sock not in events:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so this replaces the busy waiting mechanism without hanging on recv so we can quit, nice

@markmc
Copy link
Member Author

markmc commented Oct 21, 2025

FAILED entrypoints/openai/test_vision.py::test_single_chat_session_image_base64encoded_beamsearch[2-microsoft/Phi-3.5-vision-instruct] - AssertionError: assert 'The image sh... diagram with' == 'This image s...th three over'

Fixed by #26978, rebasing

@markmc markmc force-pushed the nixl-shutdown-handshake-listener-thread branch 2 times, most recently from 03de4e4 to d8297b4 Compare October 21, 2025 06:48
@mergify
Copy link

mergify bot commented Oct 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 21, 2025
@markmc markmc force-pushed the nixl-shutdown-handshake-listener-thread branch from d8297b4 to 3dd1541 Compare October 21, 2025 08:37
@mergify mergify bot removed the needs-rebase label Oct 21, 2025
It turns out that we're not terminating the handshake listener thread during
shutdown.

I discovered this while adding a (now unneeded) abort-after-finished variant
of the abort_timeout_on_prefiller test and saw a port-in-use hang.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the nixl-shutdown-handshake-listener-thread branch from 3dd1541 to cee1246 Compare October 22, 2025 11:16
@markmc
Copy link
Member Author

markmc commented Oct 22, 2025

Rebased to pick up #27262

@NickLucche NickLucche merged commit 4ca13a8 into vllm-project:main Oct 22, 2025
51 checks passed
usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…26404)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…26404)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants