[NIXL] Terminate handshake listener thread in shutdown#26404
[NIXL] Terminate handshake listener thread in shutdown#26404NickLucche merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
Thank you for this contribution. The changes correctly address the issue of the handshake listener thread not being terminated during shutdown. I've identified one high-severity issue related to the use of __del__ for resource cleanup, which is an unreliable practice in Python and can introduce thread-safety problems. Please see my detailed comment below.
| def __del__(self): | ||
| self.shutdown() |
There was a problem hiding this comment.
Using __del__ for resource cleanup is an anti-pattern in Python because its execution is not guaranteed. It may not be called if the object is part of a reference cycle, or during interpreter shutdown, which can lead to resource leaks that are hard to debug. The explicit shutdown() calls, like the one you've added in the tests, are the correct and reliable way to handle cleanup.
Furthermore, __del__ can be invoked by the garbage collector at any time and from any thread. This makes the thread-safety of shutdown() critical, but the current implementation is not thread-safe. For example:
- One thread could be iterating over
self._recving_transferswhile another clears it, causing aRuntimeError. self._nixl_handshake_listener_tcould be set toNoneby one thread after another has checked it forNonebut before calling.join()on it, resulting in anAttributeError.
I strongly recommend removing the __del__ method and relying solely on explicit shutdown() calls. If shutdown() might be called concurrently from other paths, it should be protected with a threading.Lock.
There was a problem hiding this comment.
Calling shutdown() from here is convenient for tests - it means resources are cleaned up even without explicit shutdown calls. Outside of tests, we should be calling shutdown() explicitly
Another example of us doing this:
vllm/v1/engine/async_llm.py:235: def __del__(self):
vllm/v1/engine/async_llm.py-236- self.shutdown()
cad5ba5 to
a44e75b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
a44e75b to
6336ae2
Compare
NickLucche
left a comment
There was a problem hiding this comment.
lgtm thanks @markmc , only left one comment.
Let's get ci green and merge
| events = dict(poller.poll(timeout=poll_timeout * 1000)) | ||
| if sock not in events: | ||
| continue |
There was a problem hiding this comment.
ok so this replaces the busy waiting mechanism without hanging on recv so we can quit, nice
Fixed by #26978, rebasing |
03de4e4 to
d8297b4
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
d8297b4 to
3dd1541
Compare
It turns out that we're not terminating the handshake listener thread during shutdown. I discovered this while adding a (now unneeded) abort-after-finished variant of the abort_timeout_on_prefiller test and saw a port-in-use hang. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
3dd1541 to
cee1246
Compare
|
Rebased to pick up #27262 |
…26404) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…26404) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…26404) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…26404) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…26404) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…26404) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
It turns out that we're not terminating the handshake listener thread during shutdown.
I discovered this while adding a (now unneeded) abort-after-finished variant of the abort_timeout_on_prefiller test and saw a port-in-use hang.
xref #26351