-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock from simple repeated @spawn
and wait
#36699
Comments
I can reproduce this quite robustly in my laptop with 4 physical cores using |
I just added it to the 1.5 milestone. Feel free to remove it if it's not reproducible. |
Yep, I'm seeing this on julia> versioninfo()
Julia Version 1.5.0-rc1.0
Commit 24f033c951* (2020-06-26 20:13 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Environment:
JULIA_NUM_THREADS = 3 for |
Thanks! So it looks like robustly reproducing this requires that |
I see this too, with 6 cores. When it runs it takes 5s or so. Sometimes it pauses, and can be re-started by pressing enter, and then takes 8s or 20s depending when I react. And sometimes it needs to be interrupted:
On 1.4, no problem. Also takes about 5s. (I used On 1.6.0-DEV.305, instead it runs very slowly, 90s every time, no pauses no stops. And on a 2-core machine, no problems. |
Looks like this is probably 65b8e7e. |
This seems to fix it: --- a/src/partr.c
+++ b/src/partr.c
@@ -367,8 +367,7 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
// something added to the sticky-queue: notify that thread
wake_thread(tid);
// check if we need to notify uv_run too
- unsigned long system_tid = jl_all_tls_states[tid]->system_id;
- if (uvlock != system_self && jl_atomic_load(&jl_uv_mutex.owner) == system_tid)
+ if (uvlock != system_self)
wake_libuv();
} |
I tried the patch and it fixes the issue for me. I tried a 10x longer outer loop and it finishes without a deadlock. |
If that is sufficient, then I think this should also have been sufficient (but it's not): diff --git a/src/partr.c b/src/partr.c
index 61e28814c2..fdf25b3312 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -327,7 +327,7 @@ static int sleep_check_after_threshold(uint64_t *start_cycles)
}
-static void wake_thread(int16_t tid)
+static int wake_thread(int16_t tid)
{
jl_ptls_t other = jl_all_tls_states[tid];
if (jl_atomic_load(&other->sleep_check_state) != not_sleeping) {
@@ -336,8 +336,10 @@ static void wake_thread(int16_t tid)
uv_mutex_lock(&other->sleep_lock);
uv_cond_signal(&other->wake_signal);
uv_mutex_unlock(&other->sleep_lock);
+ return 1;
}
}
+ return 0;
}
@@ -365,10 +367,8 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
}
else {
// something added to the sticky-queue: notify that thread
- wake_thread(tid);
// check if we need to notify uv_run too
- unsigned long system_tid = jl_all_tls_states[tid]->system_id;
- if (uvlock != system_self && jl_atomic_load(&jl_uv_mutex.owner) == system_tid)
+ if (wake_thread(tid) && uvlock != system_self && tid == 0)
wake_libuv();
}
// check if the other threads might be sleeping |
Fixes #36699 I belive. I think I know why this fixes it, but I need to validate that I got it correct before writing it up. Nevertheless, since I think this fixes the issue and the issue is release blocking, here's the fix ahead of the writeup.
See if #36785 helps. |
Fixes #36699 I belive. I think I know why this fixes it, but I need to validate that I got it correct before writing it up. Nevertheless, since I think this fixes the issue and the issue is release blocking, here's the fix ahead of the writeup.
Simply repeating
@spawn
andwait
causes a dead lock:In the above session, I saw '.'s printed pretty rapidly until it hangs. Then I terminated the evaluation with Ctrl-C.
I see this in 1.5.0-rc1.0 but not in 1.4.
(I tried
JULIA_RR_RECORD_ARGS='--chaos --num-cores=4' JULIA_NUM_THREADS=4 julia --startup-file=no --bug-report=rr
etc. to get a deadlock inrr
. But so far I couldn't get it underrr
.)The text was updated successfully, but these errors were encountered: