Skip to content

[CI] Fix BackgroundResources double-cleanup crash by adding guard#36299

Closed
AndreasKaratzas wants to merge 2 commits intovllm-project:mainfrom
ROCm:akaratza_fix_cleanup
Closed

[CI] Fix BackgroundResources double-cleanup crash by adding guard#36299
AndreasKaratzas wants to merge 2 commits intovllm-project:mainfrom
ROCm:akaratza_fix_cleanup

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 7, 2026

Fixes regression after: #34730

BackgroundResources.__call__() crashes with AttributeError: 'BackgroundResources' object has no attribute 'output_queue_task' when invoked more than once. This happens because the cleanup path uses del self.output_queue_task which removes the attribute entirely, so a second call fails.

This is triggered in practice when the engine monitor thread detects a dead engine and calls shutdown(), followed by the caller (e.g. a test) also calling shutdown() explicitly. Both paths end up invoking self.resources().

  • Add a _cleaned_up bool guard so __call__ subsequent calls are a no-op.
  • Replace del self.output_queue_task / del self.stats_update_task with = None assignments to clear references without removing the attribute.
  • Nil out engine_manager and coordinator after shutdown to prevent double-shutdown when MPClient.shutdown() calls engine_manager.shutdown(timeout=...) followed by self.resources().

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash caused by double-cleanup in BackgroundResources by introducing an idempotency guard. It also correctly replaces del with None assignments to prevent AttributeError on subsequent cleanup attempts. The changes are logical and directly fix the described issue. I've added one suggestion to make the idempotency guard thread-safe.

Comment on lines +392 to +394
if self._cleaned_up:
return
self._cleaned_up = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current idempotency check is not thread-safe. A race condition can occur where two threads both check self._cleaned_up before it's set to True, leading to the cleanup logic running twice. Using a threading.Lock ensures that the check-and-set operation is atomic, preventing this race.

I'm suggesting a change that introduces a lock to the BackgroundResources class and uses it within __call__ to safely manage the cleanup state. You'll need to add from threading import Lock and from dataclasses import field at the top of the file to apply this suggestion.

Suggested change
if self._cleaned_up:
return
self._cleaned_up = True
with self._cleanup_lock:
if self._cleaned_up:
return
self._cleaned_up = True

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python's GIL makes bool read/write atomic, and the underlying cleanup operations (close(linger=0), task.cancel(), setting attrs to None) are all individually idempotent. The _cleaned_up flag is an optimization to skip redundant work, not a correctness gate. Adding a Lock to a weakref finalizer target introduces complexity without practical benefit here.

engine_dead: bool = False

# Guard against double-cleanup
_cleaned_up: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To support the thread-safe idempotency guard in __call__, a lock should be added to the BackgroundResources dataclass. This requires importing Lock from threading and field from dataclasses.

Suggested change
_cleaned_up: bool = False
_cleaned_up: bool = False
_cleanup_lock: "Lock" = field(default_factory=Lock, init=False, repr=False)

@AndreasKaratzas
Copy link
Collaborator Author

cc @njhill @markmc

@AndreasKaratzas AndreasKaratzas added the rocm Related to AMD ROCm label Mar 7, 2026
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 7, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 7, 2026
@AndreasKaratzas
Copy link
Collaborator Author

Adding the ready label for tests to start.

@njhill
Copy link
Member

njhill commented Mar 7, 2026

Thanks @AndreasKaratzas, I think this may already be fixed by #36270

@AndreasKaratzas
Copy link
Collaborator Author

Thanks @AndreasKaratzas, I think this may already be fixed by #36270

@njhill Oh let me test again without this then and let you know.

@AndreasKaratzas
Copy link
Collaborator Author

@njhill Indeed, closing.

@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 7, 2026
@njhill
Copy link
Member

njhill commented Mar 7, 2026

Thanks @AndreasKaratzas 🙏

@AndreasKaratzas AndreasKaratzas deleted the akaratza_fix_cleanup branch March 7, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants