addToStore(): Evaluate on the main stack#11152
Conversation
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2024-07-22-nix-team-meeting-minutes-163/49544/1 |
This hopefully avoids the need for all our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread. Fixes NixOS#11141.
Since we're not doing this anymore.
431377e to
b8684eb
Compare
|
Team discussion:
|
There was a problem hiding this comment.
# started on Sat Aug 17 00:36:50 2024
Performance counter stats for './run-result.sh' (4 runs):
4,747.09 msec task-clock:u # 0.998 CPUs utilized ( +- 0.11% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
190,199 page-faults:u # 40.066 K/sec ( +- 0.00% )
12,047,705,665 cycles:u # 2.538 GHz ( +- 0.16% )
22,541,252,252 instructions:u # 1.87 insn per cycle ( +- 0.02% )
4,413,428,462 branches:u # 929.712 M/sec ( +- 0.02% )
74,194,689 branch-misses:u # 1.68% of all branches ( +- 0.12% )
TopdownL1 # 16.2 % tma_backend_bound
# 41.4 % tma_bad_speculation
# 15.8 % tma_frontend_bound
# 26.6 % tma_retiring ( +- 0.20% )
# Table of individual measurements:
4.75286 (-0.00515) #
4.74741 (-0.01060) #
4.76610 (+0.00809) #
4.76568 (+0.00767) #
# Final result:
4.75801 +- 0.00468 seconds time elapsed ( +- 0.10% )
# started on Sat Aug 17 00:36:05 2024
Performance counter stats for './run-stable.sh' (4 runs):
5,041.44 msec task-clock:u # 0.998 CPUs utilized ( +- 0.44% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
189,840 page-faults:u # 37.656 K/sec ( +- 0.01% )
12,857,410,437 cycles:u # 2.550 GHz ( +- 0.38% )
25,654,027,530 instructions:u # 2.00 insn per cycle ( +- 0.04% )
4,689,079,121 branches:u # 930.106 M/sec ( +- 0.03% )
79,691,381 branch-misses:u # 1.70% of all branches ( +- 0.23% )
TopdownL1 # 17.2 % tma_backend_bound
# 37.0 % tma_bad_speculation
# 14.9 % tma_frontend_bound
# 30.9 % tma_retiring ( +- 0.14% )
# Table of individual measurements:
5.1002 (+0.0494) #
5.0718 (+0.0210) #
5.0440 (-0.0069) #
4.9874 (-0.0635) #
# Final result:
5.0508 +- 0.0241 seconds time elapsed ( +- 0.48% )
|
This does make evaluation with GC slightly worse, but greatly simplifies implementation. Performance cost seems low. I could not update this branch, but have a rebase at: https://github.com/tomberek/nix/tree/flip-coroutines |
|
I also measured and got a slight (not statistically significant) speedup: |
|
Backport failed because this pull request contains merge commits. You can either backport this pull request manually, or configure the action to skip merge commits. |
|
Does this mean we can get rid of 5740924 |
This was removed in NixOS#11152. However, we need it for the multi-threaded evaluator, because otherwise Boehm GC will crash while scanning the thread stack: #0 GC_push_all_eager (bottom=<optimized out>, top=<optimized out>) at extra/../mark.c:1488 #1 0x00007ffff74691d5 in GC_push_all_stack_sections (lo=<optimized out>, hi=<optimized out>, traced_stack_sect=0x0) at extra/../mark_rts.c:704 #2 GC_push_all_stacks () at extra/../pthread_stop_world.c:876 #3 GC_default_push_other_roots () at extra/../os_dep.c:2893 #4 0x00007ffff746235c in GC_mark_some (cold_gc_frame=0x7ffee8ecaa50 "`\304G\367\377\177") at extra/../mark.c:374 #5 0x00007ffff7465a8d in GC_stopped_mark (stop_func=stop_func@entry=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:875 #6 0x00007ffff7466724 in GC_try_to_collect_inner (stop_func=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:624 #7 0x00007ffff7466a22 in GC_collect_or_expand (needed_blocks=needed_blocks@entry=1, ignore_off_page=ignore_off_page@entry=0, retry=retry@entry=0) at extra/../alloc.c:1688 #8 0x00007ffff746878f in GC_allocobj (gran=<optimized out>, kind=<optimized out>) at extra/../alloc.c:1798 #9 GC_generic_malloc_inner (lb=<optimized out>, k=k@entry=1) at extra/../malloc.c:193 #10 0x00007ffff746cd40 in GC_generic_malloc_many (lb=<optimized out>, k=<optimized out>, result=<optimized out>) at extra/../mallocx.c:477 #11 0x00007ffff746cf35 in GC_malloc_kind (bytes=120, kind=1) at extra/../thread_local_alloc.c:187 #12 0x00007ffff796ede5 in nix::allocBytes (n=<optimized out>, n=<optimized out>) at ../src/libexpr/include/nix/expr/eval-inline.hh:19 This is because it will use the stack pointer of the coroutine, so it will scan a region of memory that doesn't exist, e.g. Stack for thread 0x7ffea4ff96c0 is [0x7ffe80197af0w,0x7ffea4ffa000) (where 0x7ffe80197af0w is the sp of the coroutine and 0x7ffea4ffa000 is the base of the thread stack). We don't scan coroutine stacks, because currently they don't have GC roots (there is no evaluation happening in coroutines). So there is currently no need to restore the other parts of the original patch, such as BoehmGCStackAllocator.
This was removed in #11152. However, we need it for the multi-threaded evaluator, because otherwise Boehm GC will crash while scanning the thread stack: #0 GC_push_all_eager (bottom=<optimized out>, top=<optimized out>) at extra/../mark.c:1488 #1 0x00007ffff74691d5 in GC_push_all_stack_sections (lo=<optimized out>, hi=<optimized out>, traced_stack_sect=0x0) at extra/../mark_rts.c:704 #2 GC_push_all_stacks () at extra/../pthread_stop_world.c:876 #3 GC_default_push_other_roots () at extra/../os_dep.c:2893 #4 0x00007ffff746235c in GC_mark_some (cold_gc_frame=0x7ffee8ecaa50 "`\304G\367\377\177") at extra/../mark.c:374 #5 0x00007ffff7465a8d in GC_stopped_mark (stop_func=stop_func@entry=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:875 #6 0x00007ffff7466724 in GC_try_to_collect_inner (stop_func=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:624 #7 0x00007ffff7466a22 in GC_collect_or_expand (needed_blocks=needed_blocks@entry=1, ignore_off_page=ignore_off_page@entry=0, retry=retry@entry=0) at extra/../alloc.c:1688 #8 0x00007ffff746878f in GC_allocobj (gran=<optimized out>, kind=<optimized out>) at extra/../alloc.c:1798 #9 GC_generic_malloc_inner (lb=<optimized out>, k=k@entry=1) at extra/../malloc.c:193 #10 0x00007ffff746cd40 in GC_generic_malloc_many (lb=<optimized out>, k=<optimized out>, result=<optimized out>) at extra/../mallocx.c:477 #11 0x00007ffff746cf35 in GC_malloc_kind (bytes=120, kind=1) at extra/../thread_local_alloc.c:187 #12 0x00007ffff796ede5 in nix::allocBytes (n=<optimized out>, n=<optimized out>) at ../src/libexpr/include/nix/expr/eval-inline.hh:19 This is because it will use the stack pointer of the coroutine, so it will scan a region of memory that doesn't exist, e.g. Stack for thread 0x7ffea4ff96c0 is [0x7ffe80197af0w,0x7ffea4ffa000) (where 0x7ffe80197af0w is the sp of the coroutine and 0x7ffea4ffa000 is the base of the thread stack). We don't scan coroutine stacks, because currently they don't have GC roots (there is no evaluation happening in coroutines). So there is currently no need to restore the other parts of the original patch, such as BoehmGCStackAllocator.
Motivation
Switching from
sinkToSource()tosourceToSink()causes the path filter evaluation to happen on the thread's main stack rather than on the coroutine stack. This should avoid the need for our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread.Fixes #11141.
Context
Priorities and Process
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.