Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout generic callbacks used in destructors and inflight request cancelation #123

Merged
merged 6 commits into from
Nov 20, 2023

Conversation

pentschev
Copy link
Member

Destructors may be called from Python's garbage collector, disallowing us to disable the GIL for that operation. When that occurs, there might be a deadlock as other threads have the GIL, such as the progress thread when a Python's listener callback is invoked and executed in that thread.

Destructors may be called from Python's garbage collector, disallowing
us to disable the GIL for that operation. When that occurs, there might
be a deadlock as other threads have the GIL, such as the progress thread
when a Python's listener callback is invoked and executed in that
thread.
@pentschev pentschev requested a review from a team as a code owner November 10, 2023 14:02
@pentschev pentschev changed the title Timeout ucxx::Endpoint and ucxx::Listener destructors Timeout generic callbacks used in destructor and inflight request cancelation Nov 15, 2023
@pentschev pentschev changed the title Timeout generic callbacks used in destructor and inflight request cancelation Timeout generic callbacks used in destructors and inflight request cancelation Nov 15, 2023
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor cleanups (to avoid excessive looping when maxAttempts > 1 and a suggestion of how to implement a timeout for the spinlock case.

I was trying to think about how we could drop the gil (say) when destructing in python, but I am not sure if it is possible.

cpp/src/endpoint.cpp Outdated Show resolved Hide resolved
cpp/src/endpoint.cpp Outdated Show resolved Hide resolved
cpp/src/utils/callback_notifier.cpp Show resolved Hide resolved
cpp/src/worker.cpp Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Nov 15, 2023

I wonder if it is also a good idea to explicitly release the gil when dropping the C++ shared_ptr attributes of UCXX python cdef classes. I think something like:

diff --git a/python/ucxx/_lib/libucxx.pyx b/python/ucxx/_lib/libucxx.pyx
index 3e26bfb..ef8b4be 100644
--- a/python/ucxx/_lib/libucxx.pyx
+++ b/python/ucxx/_lib/libucxx.pyx
@@ -862,6 +862,10 @@ cdef class UCXEndpoint():
         bint _enable_python_future
         dict _close_cb_data
 
+    def __dealloc__(self):
+        with nogil:
+            self._endpoint = <shared_ptr[Endpoint]>nullptr;
+
     def __init__(
             self,
             uintptr_t shared_ptr_endpoint,

(For example) Makes sure that (if this is the last reference to the Endpoint) ~Endpoint runs with the GIL released.

WDYT?

I think this change is worthwhile anyway so I can propose in a separate PR.

@pentschev
Copy link
Member Author

I wonder if it is also a good idea to explicitly release the gil when dropping the C++ shared_ptr attributes of UCXX python cdef classes. I think something like:

diff --git a/python/ucxx/_lib/libucxx.pyx b/python/ucxx/_lib/libucxx.pyx
index 3e26bfb..ef8b4be 100644
--- a/python/ucxx/_lib/libucxx.pyx
+++ b/python/ucxx/_lib/libucxx.pyx
@@ -862,6 +862,10 @@ cdef class UCXEndpoint():
         bint _enable_python_future
         dict _close_cb_data
 
+    def __dealloc__(self):
+        with nogil:
+            self._endpoint = <shared_ptr[Endpoint]>nullptr;
+
     def __init__(
             self,
             uintptr_t shared_ptr_endpoint,

(For example) Makes sure that (if this is the last reference to the Endpoint) ~Endpoint runs with the GIL released.

WDYT?

I think this change is worthwhile anyway so I can propose in a separate PR.

Sorry, I missed this reply earlier. I think this is a great idea, it might actually do what we were attempting of releasing the GIL at destruction time, and __dealloc__ didn't occur to me before so thanks for thinking of that! I agree a separate PR is probably good, if you want to do it please feel free to submit it, otherwise I can work on it next week as well. 🙂

@wence-
Copy link
Contributor

wence- commented Nov 17, 2023

I agree a separate PR is probably good, if you want to do it please feel free to submit it, otherwise I can work on it next week as well.

#132

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit b4957da into rapidsai:branch-0.35 Nov 20, 2023
19 checks passed
@pentschev
Copy link
Member Author

Thanks for the review @wence- !

@pentschev pentschev deleted the timeout-destructors branch November 27, 2023 11:16
rapids-bot bot pushed a commit that referenced this pull request Nov 28, 2023
#123 introduced timeouts to the generic callbacks, preventing failure to acquire lock due to GIL competition. However, those were not exposed to Python and at least one of the reasons it still timeouts is because of that, notice how the default `period=0` (never unblock) is used:

```cpp
Thread 1 (Thread 0x7f36d675f740 (LWP 155586) "pytest"):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fff45058e58) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fff45058e08, cond=0x7fff45058e30) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7fff45058e30, mutex=0x7fff45058e08) at pthread_cond_wait.c:647
#3  0x00007f36d43634d4 in std::condition_variable::wait<ucxx::utils::CallbackNotifier::wait(uint64_t)::<lambda()> > (__p=..., __lock=..., this=0x7fff45058e30) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/condition_variable:103
#4  ucxx::utils::CallbackNotifier::wait (this=this@entry=0x7fff45058e00, period=period@entry=0) at /datasets/pentschev/src/ucxx-deadlock/cpp/src/utils/callback_notifier.cpp:66
#5  0x00007f36d43470e1 in ucxx::Endpoint::close (this=0x7f369c701a90, period=0, maxAttempts=1) at /datasets/pentschev/src/ucxx-deadlock/cpp/src/endpoint.cpp:171
#6  0x00007f36d4753381 in __pyx_pw_4ucxx_4_lib_7libucxx_11UCXEndpoint_9close(_object*, _object* const*, long, _object*) () from /opt/conda/envs/test/lib/python3.10/site-packages/ucxx/_lib/libucxx.cpython-310-x86_64-linux-gnu.so
```

This PR exposes those arguments to Python and specify a default for Python async API `Endpoint.abort()` to prevent such deadlocks from occurring.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #136
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants