-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no-copy stacks #13099
no-copy stacks #13099
Conversation
A nice discussion of these possibilities, as implemented in the GNU Pth library, is described in the following paper www.gnu.org/s/pth/rse-pmt.ps |
9388766
to
30cd4b1
Compare
src/task.c
Outdated
@@ -1,9 +1,22 @@ | |||
// This file is a part of Julia. License is MIT: http://julialang.org/license | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whitespace is significant to the add_license_headers script
22ddc90
to
e473194
Compare
1a8676b
to
4f063d8
Compare
a created a simple benchmark for involving this change in set of related expensive operations. this is intended to be a model of the client julia> for i = 1:6
let io = Pipe(), bigdata = tuple(collect(big(1):1000)), stk=0
Base.link_pipe(io)
@time @sync begin
schedule(Base.sync_add(Task( () -> begin
for i = 1:10
serialize(io, bigdata)
end
close(io.in)
end, stk)))
schedule(Base.sync_add(Task( () -> let cnt = 0
while !eof(io)
r = deserialize(io)
#@assert r == bigdata
cnt += 1
end
#@assert cnt == 10
end, stk)))
end
close(io)
end
end stk = 0
3.585497 seconds (4.26 M allocations: 131.958 MB, 3.11% gc time)
2.583458 seconds (3.52 M allocations: 101.017 MB, 1.27% gc time)
2.539079 seconds (3.52 M allocations: 101.017 MB, 0.84% gc time)
2.535257 seconds (3.52 M allocations: 101.017 MB, 1.17% gc time)
2.545820 seconds (3.52 M allocations: 101.017 MB, 1.04% gc time)
2.550698 seconds (3.52 M allocations: 101.017 MB, 0.88% gc time) stk = 1 MB
3.647399 seconds (4.26 M allocations: 131.958 MB, 3.30% gc time)
2.448537 seconds (3.52 M allocations: 101.017 MB, 1.43% gc time)
2.439833 seconds (3.52 M allocations: 101.017 MB, 0.88% gc time)
2.446417 seconds (3.52 M allocations: 101.017 MB, 1.21% gc time)
2.452492 seconds (3.52 M allocations: 101.017 MB, 1.13% gc time)
2.422734 seconds (3.52 M allocations: 101.017 MB, 0.89% gc time) this represents an approximate 4.5% speedup. one more benefit of this not mentioned above, is that these tasks are not pinned to a thread (unlike the usual copy_stack tasks), so it is possible they could be restarted on a different thread than the one that created them. plus, the obligatory raw numbers for just a ping-pong context switch:
(numbers computed on an Intel(R) Xeon(R) CPU E7- 8850 @ 2.00GHz) |
before or after 0.4.0? is there a test and/or failure case that's fixing? |
it should fix |
7c005a6
to
22e4d96
Compare
status update: this now appears to be stable on all platforms (the travis failure was due to 35c5234, not this PR), and it better aligns Julia's behavior with how many other systems implement this same functionality (for example: the D runtime, win32, ucontext, GNU Pth). however, while the win32 Fiber API is quite convenient, internally, it is has a large number of implementation flaws. The D language runtime (Boost licensed) has an implementation that we can borrow from to deal with this shortcoming: https://github.com/D-Programming-Language/druntime/blob/1f957372e5dadb92ab1d621d68232dbf8a2dbccf/src/core/thread.d#L4482-L4503 |
118f644
to
0aec198
Compare
0aec198
to
394d044
Compare
I've squashed the history, so if you want to see the individual commits, look at 0aec198. |
this will close #7824 |
When is this likely to be merged? JuliaGraphics/Gtk.jl#161 is an issue for me, and I'd like to know how much effort to invest in rearranging my program to work around it. |
+1 on JuliaGraphics/Gtk.jl#161 being a problem and wanting this merged if it fixes it. |
Bump (want re JuliaGraphics/Gtk.jl#161). I'm not familiar with the task infrastructure, so unfortunately I can't provide a useful review. Anyone, anyone? Bueller? |
@vtjnash: What is the status of this? Using |
I've been very low on time to write the custom windows support code to work-around the buggy native version on that platform. the |
Well its a little bit annoying to put |
394d044
to
f1878c2
Compare
@@ -101,10 +145,11 @@ struct _jl_tls_states_t { | |||
struct _jl_module_t *current_module; | |||
struct _jl_task_t *volatile current_task; | |||
struct _jl_task_t *root_task; | |||
//#ifdef COPY_STACKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional can't be used in public facing code, but if you do delete COPY_STACKS, then this would get deleted to.
ui/repl.c
Outdated
// pwine_get_version = GetProcAddress(hntdll, "wine_get_version"); | ||
// return pwine_get_version != 0; | ||
//} | ||
//#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete?
jl_ucontext_t ctx; // saved thread state | ||
void *stkbuf; // malloc'd memory (either copybuf or stack) | ||
size_t bufsz; // actual sizeof stkbuf | ||
unsigned int copy_stack:31; // sizeof stack for copybuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be called something like copy_amount
? This name sounds really similar to the boolean COPY_STACKS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it probably could. I use this as a boolean often enough though too (as the runtime version of COPY_STACKS) that I'm being intentionally a bit ambiguous.
e57a9fc
to
c661962
Compare
never copy over the root stack: this is a hybrid approach to COPY_STACK where the root task is never moved or copied, and all other task stacks are layered into the same memory area (ptls->basestack + basesize) several strategies exist for making new stacks: ucontext_t (where it is available, aka linux) unw_context_t (as an alternative to ucontext_t that avoids a syscall on task-switch) makecontext (as a posix standard implemention) setjmp/longjmp-based implementation (for systems where this is sufficient) Windows Fibers (implemented here, since we can be more efficient and reliable than the official Fibers API) also, uses an alternate stack for use in collecting stack-overflow backtraces like posix, but managed manually
c661962
to
8e04328
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have some more devdocs, but LGTM!
(I think I have read through it 5 times now, so I start to see what is happening here)
static void memcpy_a16(uint64_t *to, uint64_t *from, size_t nb) | ||
{ | ||
memcpy((char*)jl_assume_aligned(to, 16), (char*)jl_assume_aligned(from, 16), nb); | ||
//uint64_t *end = (uint64_t*)((char*)from + nb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this here just as a explanation of the above?
// select an implementation of stack switching. | ||
// currently only COPY_STACKS is recommended. | ||
#ifndef COPY_STACKS | ||
// select whether to enable the COPY_STACKS stack switching optimization | ||
#define COPY_STACKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have an explanation here or in the devdocs of the high-level strategy for task-switching.
I also wonder if we need this flag (and what the actual behavior change is that is toggled by it),
should I have the expectation that it generally works?
make -C test read
fails and I managed to run out of memory during testall
From worker 5: fatal: error thrown and no exception handler available.
From worker 5: OutOfMemoryError()
From worker 5: rec_backtrace at /home/vchuravy/src/julia/src/stackwalk.c:94
From worker 5: record_backtrace at /home/vchuravy/src/julia/src/task.c:185
From worker 5: jl_throw at /home/vchuravy/src/julia/src/task.c:374
From worker 5: jl_malloc_stack at /home/vchuravy/src/julia/src/gc-stacks.c:138
From worker 5: ctx_switch at /home/vchuravy/src/julia/src/task.c:645 [inlined]
From worker 5: jl_switchto at /home/vchuravy/src/julia/src/task.c:322
From worker 5: unknown function (ip: 0x7f1a41ac0f78)
From worker 5: unknown function (ip: 0x7f1a41ac1204)
From worker 5: unknown function (ip: 0x7f1a41b2d052)
From worker 5: jl_apply_generic at /home/vchuravy/src/julia/src/gf.c:2198
From worker 5: jl_apply at /home/vchuravy/src/julia/src/julia.h:1562 [inlined]
From worker 5: finish_task at /home/vchuravy/src/julia/src/task.c:172
From worker 5: start_task at /home/vchuravy/src/julia/src/task.c:512
From worker 5: unknown function (ip: 0xffffffffffffffff)
I am going to merge this so that we can shake out bugs (if any) and that we can base PARTR off this |
Very exciting! |
@vtjnash @vchuravy: Is this expected to solve |
I don’t think so... it just changes the way stack switching is implemented. You still can’t run Julia code not on the main thread. |
We only have one thread so that is not the issue. Once upon a time everything was working but with the introduction of COPY_STACKS Gtk broke (see |
We're still using COPY_STACKS by default; you can try turning it off in src/options.h. |
@JeffBezanson Many thanks Jeff. It is now working without a crash! This is absolutely fantastic. Can we please please make this default and rush this into a release? Julia deserves high quality Gtk bindings that are not crashing. I know that this is a feature, but at the same time it is a bugfix for JuliaGraphics/Gtk.jl#161 (from 2015) which still leads to major limitations. @vtjnash: Thanks so much! |
We are working towards that. |
Is there something missing which needs to be done other than undef COPY_STACKS? I could make a PR that toggles that if you like. |
See #13099 (comment) |
ok thanks. Then I stay patient and wait for it. |
Working on this now (we're doing a sprint on #22631, so it came up). During the read test we start 65536 tasks, so allocating a stack for each exceeds the kernel's maximum number of allowed mappings per process. We can fix this by falling back to stack-copying if we're running out of mappings. The user can request that a Task have a dedicated stack using Things to try in the future:
|
Running inference on its own large stack combined with static analysis of how much stack space a task could potentially need could potentially allow allocating very small "stack" objects. I'm not sure how common tasks where the maximum stack size is statically knowable would actually turn out to be. |
this is a experimental exploration of the ramifications of the restoration of the!COPY_STACKS
makefile option, for discussion and analysis.this changes thread state to be a unw_context_t instead of a jmp_buf. this would allow unwinding a moved stack (for temporary inspection / remote backtrace of an arbitrary task). however, this increases the size of a jl_task_t object, so I'm not sure this is worthwhile. plus, while it still shouldn't need to do a sigprocmask syscall, it still requires more code to do the context switch. it might be better to figure out how to just make due with the currentthis change removes all of the inline assembly code from this file, which is good for portability.this change also makes it possible / makes use of the possibility of having more than 1 stack, but fewer than N (where N is the number of Tasks that have ever been created). This removes the need to guess jl_stackbase in julia_init and would makes it possible to create additional tasks with a "permanent" stack (for example, for better C-interop with the Gtk event loop to solve JuliaGraphics/Gtk.jl#161), without needing to reserve a new stack for every Task. I believe this change may be beneficial for embedding julia into other contexts (not just Gtk), where modifying the primary stack may be undesirable. plus, since it allows julia_init to be called from anywhere, it may be easier to call it via an FFI (which might not be particularly near the top of the stack) or from the static initializer for a dll.
i have compiled this, with and without !COPY_STACKS, on Ubuntu, Apple, and Windows (all x86_64). I don't have any more exotic machines like FreeBSD to test on, nor have I tried it on ARM or PPC yet.
ping @JeffBezanson @carnaval