Skip to content

Conversation

@RedGrey1993
Copy link
Contributor

Description

This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers ClientObjectRef.__del__() while the DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls ClientObjectRef.__del__()
  4. __del__() attempts to call call_release() → _release_server() → DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
  6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds

Related issues

Fixes #59643

Additional information

This PR implements a deferred release pattern that completely avoids the deadlock:

  1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released
  2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously
  3. Non-blocking __del__: ClientObjectRef.__del__() now only puts IDs into the queue (no lock acquisition)

@RedGrey1993 RedGrey1993 requested a review from a team as a code owner January 9, 2026 21:17
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deferred release mechanism using a background thread to fix a critical deadlock during garbage collection. The approach is sound, and the new test case effectively reproduces the issue and validates the fix. I have identified a potential resource leak where object IDs in the release batch may not be flushed upon worker shutdown. Additionally, I've suggested an improvement to the close method to log a warning if the release thread doesn't terminate gracefully, which will help in debugging potential future issues.

@RedGrey1993 RedGrey1993 force-pushed the fix/dataclient_deadlock branch from f66a1cb to cb6e929 Compare January 9, 2026 21:52
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jan 9, 2026
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @RedGrey1993!

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 10, 2026
@RedGrey1993 RedGrey1993 force-pushed the fix/dataclient_deadlock branch from 755b4b0 to 5f4a5b5 Compare January 10, 2026 01:34
Signed-off-by: redgrey1993 <[email protected]>
@RedGrey1993 RedGrey1993 force-pushed the fix/dataclient_deadlock branch from 5f4a5b5 to 5f0e7bb Compare January 10, 2026 01:41
@RedGrey1993 RedGrey1993 requested a review from edoakes January 10, 2026 01:54
@RedGrey1993
Copy link
Contributor Author

@edoakes Thanks for the review. I've updated the code according to your suggestions. Please review again at your convenience.

@edoakes edoakes merged commit f512139 into ray-project:master Jan 13, 2026
6 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() → _release_server() →
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <[email protected]>
Co-authored-by: redgrey1993 <[email protected]>
Signed-off-by: jasonwrwang <[email protected]>
rushikeshadhav pushed a commit to rushikeshadhav/ray that referenced this pull request Jan 14, 2026
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() → _release_server() →
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643 

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <[email protected]>
Co-authored-by: redgrey1993 <[email protected]>
@RedGrey1993 RedGrey1993 deleted the fix/dataclient_deadlock branch January 15, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Deadlock in DataClient due to recursive lock acquisition during garbage collection (__del__)

2 participants