[CI] Fix BackgroundResources double-cleanup crash by adding guard#36299
[CI] Fix BackgroundResources double-cleanup crash by adding guard#36299AndreasKaratzas wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a crash caused by double-cleanup in BackgroundResources by introducing an idempotency guard. It also correctly replaces del with None assignments to prevent AttributeError on subsequent cleanup attempts. The changes are logical and directly fix the described issue. I've added one suggestion to make the idempotency guard thread-safe.
| if self._cleaned_up: | ||
| return | ||
| self._cleaned_up = True |
There was a problem hiding this comment.
The current idempotency check is not thread-safe. A race condition can occur where two threads both check self._cleaned_up before it's set to True, leading to the cleanup logic running twice. Using a threading.Lock ensures that the check-and-set operation is atomic, preventing this race.
I'm suggesting a change that introduces a lock to the BackgroundResources class and uses it within __call__ to safely manage the cleanup state. You'll need to add from threading import Lock and from dataclasses import field at the top of the file to apply this suggestion.
| if self._cleaned_up: | |
| return | |
| self._cleaned_up = True | |
| with self._cleanup_lock: | |
| if self._cleaned_up: | |
| return | |
| self._cleaned_up = True |
There was a problem hiding this comment.
Python's GIL makes bool read/write atomic, and the underlying cleanup operations (close(linger=0), task.cancel(), setting attrs to None) are all individually idempotent. The _cleaned_up flag is an optimization to skip redundant work, not a correctness gate. Adding a Lock to a weakref finalizer target introduces complexity without practical benefit here.
| engine_dead: bool = False | ||
|
|
||
| # Guard against double-cleanup | ||
| _cleaned_up: bool = False |
There was a problem hiding this comment.
To support the thread-safe idempotency guard in __call__, a lock should be added to the BackgroundResources dataclass. This requires importing Lock from threading and field from dataclasses.
| _cleaned_up: bool = False | |
| _cleaned_up: bool = False | |
| _cleanup_lock: "Lock" = field(default_factory=Lock, init=False, repr=False) |
|
Adding the |
|
Thanks @AndreasKaratzas, I think this may already be fixed by #36270 |
@njhill Oh let me test again without this then and let you know. |
|
@njhill Indeed, closing. |
|
Thanks @AndreasKaratzas 🙏 |
Fixes regression after: #34730
BackgroundResources.__call__()crashes withAttributeError: 'BackgroundResources' object has no attribute 'output_queue_task'when invoked more than once. This happens because the cleanup path usesdel self.output_queue_taskwhich removes the attribute entirely, so a second call fails.This is triggered in practice when the engine monitor thread detects a dead engine and calls
shutdown(), followed by the caller (e.g. a test) also callingshutdown()explicitly. Both paths end up invokingself.resources()._cleaned_upbool guard so__call__subsequent calls are a no-op.del self.output_queue_task/del self.stats_update_taskwith= Noneassignments to clear references without removing the attribute.engine_managerandcoordinatorafter shutdown to prevent double-shutdown whenMPClient.shutdown()callsengine_manager.shutdown(timeout=...)followed byself.resources().cc @kenroche