-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Darwin/ARM64: Julia freezes on nested @threads
loops
#41820
Comments
Since I also have an ARM-based computer (PinebookPro/Rockchip), I tried out this MWE on The script made it all the way to 1000 with no issue (I use scripts with #!/usr/bin/env julia Only thing I noticed was this ran surprisingly slowly and maxed out 1 core. Looping in main loop for i in 1:50 with @threads: Looping in main loop for i in 1:50 with @threads removed: |
Did you start Julia with more than one thread?Maybe run your script with |
I can confirm the issue on the M1. CPU usage in Activity Monitor goes to 0, it looks like a deadlock. @brett-mahar I think you're only seeing overhead of |
You are both right, I didn't know about the -t command line flag using ```#!/home/brett/.julia/julia-2021-08-07/bin/julia -t 4`` at the shebang line, htop confirms multiple cores in use: My comment was unclear, it wasn't the overhead of @threads that I found surprisingly slow, I was referring to both versions, with and without @threads (we only have a rand(100) and a println(i) inside the loops, that's what I thought would have been faster :-) ). Anyway, I used Here is the MWE in @code_native julia> @code_native binary=true debuginfo=:default include("/home/brett/.JULIA-LANGUAGE/threads-MWE.jl") |
Here is the Detailsjulia> function foobar()
for i in 1:1000
println(i)
for j in 1:10000
foo()
end
end
end
foobar (generic function with 1 method)
julia> @code_native binary=true debuginfo=:default foobar()
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 11, 0
.section __TEXT,__literal16,16byte_literals
.p2align 4 ; -- Begin function julia_foobar_249
lCPI0_0:
.quad 1 ; 0x1
.quad 10 ; 0xa
.section __TEXT,__text,regular,pure_instructions
.globl _julia_foobar_249
.p2align 2
_julia_foobar_249: ; @julia_foobar_249
; ┌ @ REPL[34]:1 within `foobar`
.cfi_startproc
; %bb.0: ; %top
sub sp, sp, #160 ; encoding: [0xff,0x83,0x02,0xd1]
; =160
stp x28, x27, [sp, #64] ; 16-byte Folded Spill
; encoding: [0xfc,0x6f,0x04,0xa9]
stp x26, x25, [sp, #80] ; 16-byte Folded Spill
; encoding: [0xfa,0x67,0x05,0xa9]
stp x24, x23, [sp, #96] ; 16-byte Folded Spill
; encoding: [0xf8,0x5f,0x06,0xa9]
stp x22, x21, [sp, #112] ; 16-byte Folded Spill
; encoding: [0xf6,0x57,0x07,0xa9]
stp x20, x19, [sp, #128] ; 16-byte Folded Spill
; encoding: [0xf4,0x4f,0x08,0xa9]
stp x29, x30, [sp, #144] ; 16-byte Folded Spill
; encoding: [0xfd,0x7b,0x09,0xa9]
.cfi_def_cfa_offset 160
.cfi_offset w30, -8
.cfi_offset w29, -16
.cfi_offset w19, -24
.cfi_offset w20, -32
.cfi_offset w21, -40
.cfi_offset w22, -48
.cfi_offset w23, -56
.cfi_offset w24, -64
.cfi_offset w25, -72
.cfi_offset w26, -80
.cfi_offset w27, -88
.cfi_offset w28, -96
stp xzr, xzr, [sp, #16] ; encoding: [0xff,0x7f,0x01,0xa9]
str xzr, [sp, #32] ; encoding: [0xff,0x13,0x00,0xf9]
mov x8, #42580 ; encoding: [0x88,0xca,0x94,0xd2]
movk x8, #34658, lsl #16 ; encoding: [0x48,0xec,0xb0,0xf2]
movk x8, #1, lsl #32 ; encoding: [0x28,0x00,0xc0,0xf2]
mov w0, #260 ; encoding: [0x80,0x20,0x80,0x52]
blr x8 ; encoding: [0x00,0x01,0x3f,0xd6]
mov x19, x0 ; encoding: [0xf3,0x03,0x00,0xaa]
mov w8, #4 ; encoding: [0x88,0x00,0x80,0x52]
str x8, [sp, #16] ; encoding: [0xe8,0x0b,0x00,0xf9]
ldr x8, [x0] ; encoding: [0x08,0x00,0x40,0xf9]
str x8, [sp, #24] ; encoding: [0xe8,0x0f,0x00,0xf9]
add x8, sp, #16 ; encoding: [0xe8,0x43,0x00,0x91]
; =16
str x8, [x0] ; encoding: [0x08,0x00,0x00,0xf9]
mov w20, #1 ; encoding: [0x34,0x00,0x80,0x52]
mov x22, #2656 ; encoding: [0x16,0x4c,0x81,0xd2]
movk x22, #199, lsl #16 ; encoding: [0xf6,0x18,0xa0,0xf2]
movk x22, #1, lsl #32 ; encoding: [0x36,0x00,0xc0,0xf2]
Lloh0:
adrp x23, _jl_gc_pool_alloc@GOTPAGE ; encoding: [0x17'A',A,A,0x90'A']
; fixup A - offset: 0, value: _jl_gc_pool_alloc@GOTPAGE, kind: fixup_aarch64_pcrel_adrp_imm21
Lloh1:
ldr x23, [x23, _jl_gc_pool_alloc@GOTPAGEOFF] ; encoding: [0xf7,0bAAAAAA10,0b01AAAAAA,0xf9]
; fixup A - offset: 0, value: _jl_gc_pool_alloc@GOTPAGEOFF, kind: fixup_aarch64_ldst_imm12_scale8
mov x24, #64688 ; encoding: [0x18,0x96,0x9f,0xd2]
movk x24, #1440, lsl #16 ; encoding: [0x18,0xb4,0xa0,0xf2]
movk x24, #1, lsl #32 ; encoding: [0x38,0x00,0xc0,0xf2]
Lloh2:
adrp x8, lCPI0_0@GOTPAGE ; encoding: [0x08'A',A,A,0x90'A']
; fixup A - offset: 0, value: lCPI0_0@GOTPAGE, kind: fixup_aarch64_pcrel_adrp_imm21
Lloh3:
ldr x8, [x8, lCPI0_0@GOTPAGEOFF] ; encoding: [0x08,0bAAAAAA01,0b01AAAAAA,0xf9]
; fixup A - offset: 0, value: lCPI0_0@GOTPAGEOFF, kind: fixup_aarch64_ldst_imm12_scale8
ldr q0, [x8] ; encoding: [0x00,0x01,0xc0,0x3d]
str q0, [sp] ; 16-byte Folded Spill
; encoding: [0xe0,0x03,0x80,0x3d]
Lloh4:
adrp x25, _j1_threading_run_252@GOTPAGE ; encoding: [0x19'A',A,A,0x90'A']
; fixup A - offset: 0, value: _j1_threading_run_252@GOTPAGE, kind: fixup_aarch64_pcrel_adrp_imm21
Lloh5:
ldr x25, [x25, _j1_threading_run_252@GOTPAGEOFF] ; encoding: [0x39,0bAAAAAA11,0b01AAAAAA,0xf9]
; fixup A - offset: 0, value: _j1_threading_run_252@GOTPAGEOFF, kind: fixup_aarch64_ldst_imm12_scale8
mov x26, #62768 ; encoding: [0x1a,0xa6,0x9e,0xd2]
movk x26, #1449, lsl #16 ; encoding: [0x3a,0xb5,0xa0,0xf2]
movk x26, #1, lsl #32 ; encoding: [0x3a,0x00,0xc0,0xf2]
mov x27, #58400 ; encoding: [0x1b,0x84,0x9c,0xd2]
movk x27, #2573, lsl #16 ; encoding: [0xbb,0x41,0xa1,0xf2]
movk x27, #1, lsl #32 ; encoding: [0x3b,0x00,0xc0,0xf2]
Lloh6:
adrp x28, _jl_f__call_latest@GOTPAGE ; encoding: [0x1c'A',A,A,0x90'A']
; fixup A - offset: 0, value: _jl_f__call_latest@GOTPAGE, kind: fixup_aarch64_pcrel_adrp_imm21
Lloh7:
ldr x28, [x28, _jl_f__call_latest@GOTPAGEOFF] ; encoding: [0x9c,0bAAAAAA11,0b01AAAAAA,0xf9]
; fixup A - offset: 0, value: _jl_f__call_latest@GOTPAGEOFF, kind: fixup_aarch64_ldst_imm12_scale8
b LBB0_2 ; encoding: [A,A,A,0b000101AA]
; fixup A - offset: 0, value: LBB0_2, kind: fixup_aarch64_pcrel_branch26
LBB0_1: ; %L37
; in Loop: Header=BB0_2 Depth=1
; │ @ REPL[34]:5 within `foobar`
; │┌ @ range.jl:832 within `iterate`
add x8, x20, #1 ; encoding: [0x88,0x06,0x00,0x91]
; =1
; │└
cmp x20, #1000 ; encoding: [0x9f,0xa2,0x0f,0xf1]
; =1000
mov x20, x8 ; encoding: [0xf4,0x03,0x08,0xaa]
b.eq LBB0_7 ; encoding: [0bAAA00000,A,A,0x54]
; fixup A - offset: 0, value: LBB0_7, kind: fixup_aarch64_pcrel_branch19
LBB0_2: ; %L2
; =>This Loop Header: Depth=1
; Child Loop BB0_4 Depth 2
; │ @ REPL[34]:3 within `foobar`
mov x0, x20 ; encoding: [0xe0,0x03,0x14,0xaa]
Lloh8:
adrp x8, _j_println_251@GOTPAGE ; encoding: [0x08'A',A,A,0x90'A']
; fixup A - offset: 0, value: _j_println_251@GOTPAGE, kind: fixup_aarch64_pcrel_adrp_imm21
Lloh9:
ldr x8, [x8, _j_println_251@GOTPAGEOFF] ; encoding: [0x08,0bAAAAAA01,0b01AAAAAA,0xf9]
; fixup A - offset: 0, value: _j_println_251@GOTPAGEOFF, kind: fixup_aarch64_ldst_imm12_scale8
blr x8 ; encoding: [0x00,0x01,0x3f,0xd6]
mov w21, #10000 ; encoding: [0x15,0xe2,0x84,0x52]
b LBB0_4 ; encoding: [A,A,A,0b000101AA]
; fixup A - offset: 0, value: LBB0_4, kind: fixup_aarch64_pcrel_branch26
LBB0_3: ; %L23
; in Loop: Header=BB0_4 Depth=2
; │ @ REPL[34]:5 within `foobar`
; │┌ @ REPL[2]:2 within `foo`
; ││┌ @ threadingconstructs.jl:90 within `macro expansion`
; │││┌ @ essentials.jl:716 within `invokelatest`
; ││││┌ @ essentials.jl:718 within `#invokelatest#2`
stp x26, x27, [sp, #48] ; encoding: [0xfa,0x6f,0x03,0xa9]
add x1, sp, #48 ; encoding: [0xe1,0xc3,0x00,0x91]
; =48
mov x0, #0 ; encoding: [0x00,0x00,0x80,0xd2]
mov w2, #2 ; encoding: [0x42,0x00,0x80,0x52]
blr x28 ; encoding: [0x80,0x03,0x3f,0xd6]
; │└└└└
; │┌ @ range.jl:832 within `iterate`
; ││┌ @ promotion.jl:429 within `==`
subs x21, x21, #1 ; encoding: [0xb5,0x06,0x00,0xf1]
; =1
; │└└
b.eq LBB0_1 ; encoding: [0bAAA00000,A,A,0x54]
; fixup A - offset: 0, value: LBB0_1, kind: fixup_aarch64_pcrel_branch19
LBB0_4: ; %L6
; Parent Loop BB0_2 Depth=1
; => This Inner Loop Header: Depth=2
; │┌ @ REPL[2]:2 within `foo`
; ││┌ @ threadingconstructs.jl:89 within `macro expansion`
; │││┌ @ threadingconstructs.jl:10 within `threadid`
ldurh w8, [x19, #-4] ; encoding: [0x68,0xc2,0x5f,0x78]
; │││└
cbnz w8, LBB0_3 ; encoding: [0bAAA01000,A,A,0x35]
; fixup A - offset: 0, value: LBB0_3, kind: fixup_aarch64_pcrel_branch19
; %bb.5: ; %L15
; in Loop: Header=BB0_4 Depth=2
blr x22 ; encoding: [0xc0,0x02,0x3f,0xd6]
cbnz w0, LBB0_3 ; encoding: [0bAAA00000,A,A,0x35]
; fixup A - offset: 0, value: LBB0_3, kind: fixup_aarch64_pcrel_branch19
; %bb.6: ; %L21
; in Loop: Header=BB0_4 Depth=2
; ││└
; ││┌ @ threadingconstructs.jl:97 within `macro expansion`
ldr x0, [x19, #16] ; encoding: [0x60,0x0a,0x40,0xf9]
mov w1, #1440 ; encoding: [0x01,0xb4,0x80,0x52]
mov w2, #32 ; encoding: [0x02,0x04,0x80,0x52]
blr x23 ; encoding: [0xe0,0x02,0x3f,0xd6]
stur x24, [x0, #-8] ; encoding: [0x18,0x80,0x1f,0xf8]
ldr q0, [sp] ; 16-byte Folded Reload
; encoding: [0xe0,0x03,0xc0,0x3d]
str q0, [x0] ; encoding: [0x00,0x00,0x80,0x3d]
str x0, [sp, #32] ; encoding: [0xe0,0x13,0x00,0xf9]
str x0, [sp, #48] ; encoding: [0xe0,0x1b,0x00,0xf9]
add x1, sp, #48 ; encoding: [0xe1,0xc3,0x00,0x91]
; =48
mov x0, #54704 ; encoding: [0x00,0xb6,0x9a,0xd2]
movk x0, #3500, lsl #16 ; encoding: [0x80,0xb5,0xa1,0xf2]
movk x0, #1, lsl #32 ; encoding: [0x20,0x00,0xc0,0xf2]
mov w2, #1 ; encoding: [0x22,0x00,0x80,0x52]
blr x25 ; encoding: [0x20,0x03,0x3f,0xd6]
; │└└
; │┌ @ range.jl:832 within `iterate`
; ││┌ @ promotion.jl:429 within `==`
subs x21, x21, #1 ; encoding: [0xb5,0x06,0x00,0xf1]
; =1
; │└└
b.ne LBB0_4 ; encoding: [0bAAA00001,A,A,0x54]
; fixup A - offset: 0, value: LBB0_4, kind: fixup_aarch64_pcrel_branch19
b LBB0_1 ; encoding: [A,A,A,0b000101AA]
; fixup A - offset: 0, value: LBB0_1, kind: fixup_aarch64_pcrel_branch26
LBB0_7: ; %L49
ldr x8, [sp, #24] ; encoding: [0xe8,0x0f,0x40,0xf9]
str x8, [x19] ; encoding: [0x68,0x02,0x00,0xf9]
ldp x29, x30, [sp, #144] ; 16-byte Folded Reload
; encoding: [0xfd,0x7b,0x49,0xa9]
ldp x20, x19, [sp, #128] ; 16-byte Folded Reload
; encoding: [0xf4,0x4f,0x48,0xa9]
ldp x22, x21, [sp, #112] ; 16-byte Folded Reload
; encoding: [0xf6,0x57,0x47,0xa9]
ldp x24, x23, [sp, #96] ; 16-byte Folded Reload
; encoding: [0xf8,0x5f,0x46,0xa9]
ldp x26, x25, [sp, #80] ; 16-byte Folded Reload
; encoding: [0xfa,0x67,0x45,0xa9]
ldp x28, x27, [sp, #64] ; 16-byte Folded Reload
; encoding: [0xfc,0x6f,0x44,0xa9]
add sp, sp, #160 ; encoding: [0xff,0x83,0x02,0x91]
; =160
ret ; encoding: [0xc0,0x03,0x5f,0xd6]
.loh AdrpLdrGot Lloh6, Lloh7
.loh AdrpLdrGot Lloh4, Lloh5
.loh AdrpLdrGot Lloh2, Lloh3
.loh AdrpLdrGot Lloh0, Lloh1
.loh AdrpLdrGot Lloh8, Lloh9
.cfi_endproc
; └
; -- End function
.subsections_via_symbols |
A bit more debugging shows that the nesting of the julia> function foo()
Threads.@threads for i in 1:10
rand()
end
end
foo (generic function with 1 method)
julia> for i in 1:1000
println(i)
for j in 1:10000
foo()
end
end |
FWIW On M1 Max, the above MWE freezes for me on Julia 1.6.4 or Julia 1.7.0-rc3 using Rosetta with >=8 threads and on native 1.7.0-rc3 with >=4 threads but works with smaller number of threads. |
Can confirm Denis-Titov's experience on another M1 Max. My program (and gbaraldi's MWE) run fine under Rosetta with <=7 threads on an M1 Max, but deadlock with >= 8. |
Just ran into this issue in 1.7 on an M1 Max when trying to train an
|
Doing a spindump shows that 3 threads are waiting on Arm spindumpDate/Time: 2021-12-06 17:43:54.778 -0300 End time: 2021-12-06 17:44:04.781 -0300 OS Version: macOS 12.0.1 (Build 21A559) Architecture: arm64e Report Version: 35 Share With Devs: YesData Source: Stackshots Command: julia Duration: 10.00s Hardware model: MacBookPro17,1 Time Awake Since Boot: 31204s Fan speed: 0 rpm Heavy format: stacks are sorted by countProcess: julia [10993] Thread 0x373d6 DispatchQueue 8875881432 1001 samples (1-1001) priority 31 (base 31) Thread 0x373d7 1001 samples (1-1001) priority 31 (base 31) Thread 0x373e7 1001 samples (1-1001) priority 31 (base 31) Thread 0x373e8 1001 samples (1-1001) priority 31 (base 31) Thread 0x373e9 1001 samples (1-1001) priority 31 (base 31) Thread 0x373ea 1001 samples (1-1001) priority 31 (base 31) Thread 0x373eb 1001 samples (1-1001) priority 31 (base 31) Thread 0x373ec 1001 samples (1-1001) priority 31 (base 31) Thread 0x373ed 1001 samples (1-1001) priority 31 (base 31) Binary Images: rosetta spindumpDate/Time: 2021-12-06 17:41:17.057 -0300 End time: 2021-12-06 17:41:27.062 -0300 OS Version: macOS 12.0.1 (Build 21A559) Architecture: arm64e Report Version: 35 Share With Devs: YesData Source: Stackshots Command: julia Duration: 10.00s Hardware model: MacBookPro17,1 Time Awake Since Boot: 31046s Fan speed: 0 rpm Heavy format: stacks are sorted by countProcess: julia [10934] Thread 0x367cd DispatchQueue 140704661143000 1001 samples (1-1001) priority 31 (base 31) cpu time 9.987s (31.3G cycles, 91.5G instructions, 0.34c/i) Thread 0x367d0 Thread name "com.apple.rosetta.exceptionserver" 1001 samples (1-1001) priority 31 (base 31) Thread 0x367d1 1001 samples (1-1001) priority 31 (base 31) Thread 0x367da 1001 samples (1-1001) priority 31 (base 31) Thread 0x367db 1001 samples (1-1001) priority 31 (base 31) Thread 0x367dc 1001 samples (1-1001) priority 31 (base 31) Thread 0x367dd 1001 samples (1-1001) priority 31 (base 31) Thread 0x367de 1001 samples (1-1001) priority 31 (base 31) Thread 0x367df 1001 samples (1-1001) priority 31 (base 31) Thread 0x367e0 1001 samples (1-1001) priority 31 (base 31) Thread 0x367e1 1001 samples (1-1001) priority 31 (base 31) Thread 0x367e2 1001 samples (1-1001) priority 31 (base 31) Thread 0x367e3 1001 samples (1-1001) priority 31 (base 31) Thread 0x367e4 1001 samples (1-1001) priority 31 (base 31) Binary Images: |
As is par for the course for such issues, it goes away when enabling |
TSAN finds some races for the above test case:
More output
On x86_64-linux, TSAN is completely quiet for this test case. |
The test case appears to run without ever deadlocking on AWS Graviton2 (aarch64-linux) with 32 threads. |
The deadlock issue reappears when commenting out this: Line 329 in 2714cb9
|
No TSan violations on aarch64-linux/AWS Graviton2 either (apart from in the Ctrl-C signal handler) – very mysterious. |
Can you build a TSan version on rosetta? Just for the extra data point? |
I'm not sure I'll get around to compiling stuff in Rosetta – everything is quite slow with TSan, and I'm not sure what I'd learn from a Rosetta trace. I did, however, re-run things on aarch64 with libuv also instrumented:
More output
The first one at first glance looks like a false positive – libuv appears to use volatile reads and GCC sync builtins instead of C11 memory model primitives, and the read in question is an opportunistic "relaxed" read, which is fine in practice, although probably undefined behaviour in principle in the C++ memory model – and the others are like before. I'm still not sure whether there is an actual issue here, or whether this is just an instrumentation issue. |
My idea with a Rosetta trace is to find out what is the common thing between running in rosetta and running in arm that triggers the deadlock, tho I have my suspicions that it's related to the big.Little design and running something in the economy cores, there were some other issues that were fixed by setting max cores = performance cores. |
The issue occurs already with JULIA_NUM_THREADS=2 on my M1 Max machine. As for commonalities, perhaps you could start by looking at the Rosetta version in a debugger? |
I will try to build julia for x86 so I can do a debug build |
As it turns out it's quite simple to build julia for x86 on the m1, just open a terminal via rosetta and it will lie to make building an x86 binary. |
The TSan-reported races don't disappear with |
I don't think this is any better on Linux? |
I think it's not related to feature detection since it also happens in rosetta, running julia with tsan in rosetta/x86 I get https://gist.github.com/gbaraldi/ba9f20b83f12c5b8f194fd76eccbb77a which looks like the same as in arm64. |
This appears to "fix" the issue, i.e. gets the above test case and --- a/src/partr.c
+++ b/src/partr.c
@@ -458,6 +460,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
jl_atomic_store(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us
continue;
}
+ __sync_synchronize();
task = get_next_task(trypoptask, q); // WARNING: this should not yield
if (ptls != ct->ptls)
continue; // oops, get_next_task did yield--start over Of course, this isn't an actually correct fix, but just papers over the issue by sprinkling a |
Okay, I think I'll have to leave this issue here for a while, as I don't have the bandwidth to trace through all the runtime stack. There are definitely some relaxed loads around where some issues could be hiding (e.g. the usual relaxed loads/stores reordering past SC atomics trap on ARM), but I haven't succeeded in shrinking the area to look at much. Just to add some final detail, the |
Given that we use off-thread signals on Darwin to implement safe points (and call |
Most of those seem likely (mostly) benign, though we continue to work towards annotating them correctly. (If you are interested in fixing libuv, we would be happy to merge a PR there that switches everything to C11 atomics there). But this one looks bad, as the GC is prohibited from running here:
But perhaps this is an annotation issue, as we were about to destroy that thread, so possibly the compiler did not realize our fibers are eventually synchronized for GC on a per-thread basis? |
I am fairly sure that fence is needed slightly earlier at 6bba175#diff-12cc999d04343402aae8088cc67f9b6680cc9403d967e400e5235415e0040f47R473-R474 But also slightly later at On many platforms, doing seqcst on the operation or as a fence emit the same instructions, but not on AArch64 (https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html) where seqcst is translated to merely an acquire/release on that operation, and does not also emit the required |
I built latest |
Can confirm, does not work on master but works on master + cherry picked 38620a (which is the same commit after the rebase). |
Now on master: 6df6105. |
The arm64 native 1.7-beta3 sometimes hangs on nested threaded loops.
MWE:
Hangs for me after ~20-500 printed numbers. The non-minimal code was ^C-interruptable with the following stacktrace:
ref #36617 (comment)
The text was updated successfully, but these errors were encountered: