-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Core] Fix deadlock in garbage collection when holding lock #60014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix deadlock in garbage collection when holding lock #60014
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a deferred release mechanism using a background thread to fix a critical deadlock during garbage collection. The approach is sound, and the new test case effectively reproduces the issue and validates the fix. I have identified a potential resource leak where object IDs in the release batch may not be flushed upon worker shutdown. Additionally, I've suggested an improvement to the close method to log a warning if the release thread doesn't terminate gracefully, which will help in debugging potential future issues.
Signed-off-by: redgrey1993 <[email protected]>
f66a1cb to
cb6e929
Compare
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @RedGrey1993!
755b4b0 to
5f4a5b5
Compare
Signed-off-by: redgrey1993 <[email protected]>
5f4a5b5 to
5f0e7bb
Compare
|
@edoakes Thanks for the review. I've updated the code according to your suggestions. Please review again at your convenience. |
…ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() → _release_server() → DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <[email protected]> Co-authored-by: redgrey1993 <[email protected]> Signed-off-by: jasonwrwang <[email protected]>
…ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() → _release_server() → DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <[email protected]> Co-authored-by: redgrey1993 <[email protected]>
Description
This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers
ClientObjectRef.__del__()while the DataClient lock is held.When using Ray Client, a deadlock can occur in the following scenario:
ClientObjectRef.__del__()__del__()attempts to call call_release() → _release_server() → DataClient.ReleaseObject()Related issues
Additional information
This PR implements a deferred release pattern that completely avoids the deadlock:
__del__:ClientObjectRef.__del__()now only puts IDs into the queue (no lock acquisition)