Skip to content

addToStore(): Evaluate on the main stack#11152

Merged
edolstra merged 5 commits intoNixOS:masterfrom
DeterminateSystems:flip-coroutines
Aug 20, 2024
Merged

addToStore(): Evaluate on the main stack#11152
edolstra merged 5 commits intoNixOS:masterfrom
DeterminateSystems:flip-coroutines

Conversation

@edolstra
Copy link
Member

@edolstra edolstra commented Jul 22, 2024

Motivation

Switching from sinkToSource() to sourceToSink() causes the path filter evaluation to happen on the thread's main stack rather than on the coroutine stack. This should avoid the need for our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread.

Fixes #11141.

Context

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added the store Issues and pull requests concerning the Nix store label Jul 22, 2024
@edolstra edolstra marked this pull request as ready for review July 22, 2024 13:19
@edolstra edolstra requested a review from Ericson2314 as a code owner July 22, 2024 13:19
@roberth roberth changed the title addToStore(): Do evaluation on the main stack addToStore(): Evaluate on the main stack Jul 24, 2024
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2024-07-22-nix-team-meeting-minutes-163/49544/1

edolstra added 4 commits July 24, 2024 16:24
This hopefully avoids the need for all our Boehm GC coroutine
workarounds, since the GC roots will be on the main stack of the
thread.

Fixes NixOS#11141.
Since we're not doing this anymore.
@edolstra
Copy link
Member Author

Team discussion:

  • Need to benchmark that this doesn't introduce a performance regression.
  • Otherwise looks good.

Copy link
Contributor

@tomberek tomberek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# started on Sat Aug 17 00:36:50 2024


 Performance counter stats for './run-result.sh' (4 runs):

          4,747.09 msec task-clock:u                     #    0.998 CPUs utilized               ( +-  0.11% )
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           190,199      page-faults:u                    #   40.066 K/sec                       ( +-  0.00% )
    12,047,705,665      cycles:u                         #    2.538 GHz                         ( +-  0.16% )
    22,541,252,252      instructions:u                   #    1.87  insn per cycle              ( +-  0.02% )
     4,413,428,462      branches:u                       #  929.712 M/sec                       ( +-  0.02% )
        74,194,689      branch-misses:u                  #    1.68% of all branches             ( +-  0.12% )
                        TopdownL1                 #     16.2 %  tma_backend_bound      
                                                  #     41.4 %  tma_bad_speculation    
                                                  #     15.8 %  tma_frontend_bound     
                                                  #     26.6 %  tma_retiring             ( +-  0.20% )

           # Table of individual measurements:
           4.75286 (-0.00515) #
           4.74741 (-0.01060) #
           4.76610 (+0.00809) #
           4.76568 (+0.00767) #

           # Final result:
           4.75801 +- 0.00468 seconds time elapsed  ( +-  0.10% )

# started on Sat Aug 17 00:36:05 2024


 Performance counter stats for './run-stable.sh' (4 runs):

          5,041.44 msec task-clock:u                     #    0.998 CPUs utilized               ( +-  0.44% )
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           189,840      page-faults:u                    #   37.656 K/sec                       ( +-  0.01% )
    12,857,410,437      cycles:u                         #    2.550 GHz                         ( +-  0.38% )
    25,654,027,530      instructions:u                   #    2.00  insn per cycle              ( +-  0.04% )
     4,689,079,121      branches:u                       #  930.106 M/sec                       ( +-  0.03% )
        79,691,381      branch-misses:u                  #    1.70% of all branches             ( +-  0.23% )
                        TopdownL1                 #     17.2 %  tma_backend_bound      
                                                  #     37.0 %  tma_bad_speculation    
                                                  #     14.9 %  tma_frontend_bound     
                                                  #     30.9 %  tma_retiring             ( +-  0.14% )

            # Table of individual measurements:
            5.1002 (+0.0494) #
            5.0718 (+0.0210) #
            5.0440 (-0.0069) #
            4.9874 (-0.0635) #

            # Final result:
            5.0508 +- 0.0241 seconds time elapsed  ( +-  0.48% )

@tomberek
Copy link
Contributor

This does make evaluation with GC slightly worse, but greatly simplifies implementation. Performance cost seems low.

I could not update this branch, but have a rebase at: https://github.com/tomberek/nix/tree/flip-coroutines

@edolstra
Copy link
Member Author

I also measured and got a slight (not statistically significant) speedup:

elapsed time:       median =     18.8708  mean =     18.8354  stddev =      0.0964  min =     18.6845  max =     18.9424  [not rejected, p=0.10362, Δ=-0.10630±0.20030]

@edolstra edolstra enabled auto-merge August 20, 2024 15:21
@edolstra edolstra merged commit e1d1eac into NixOS:master Aug 20, 2024
@edolstra edolstra deleted the flip-coroutines branch August 20, 2024 16:19
@github-actions
Copy link

Backport failed because this pull request contains merge commits. You can either backport this pull request manually, or configure the action to skip merge commits.

edolstra added a commit to DeterminateSystems/nix-src that referenced this pull request Aug 23, 2024
@Mic92
Copy link
Member

Mic92 commented Sep 22, 2024

Does this mean we can get rid of 5740924 enableLargeConfig = true?

edolstra added a commit to DeterminateSystems/nix-src that referenced this pull request Aug 6, 2025
This was removed in NixOS#11152. However,
we need it for the multi-threaded evaluator, because otherwise Boehm
GC will crash while scanning the thread stack:

  #0  GC_push_all_eager (bottom=<optimized out>, top=<optimized out>) at extra/../mark.c:1488
  #1  0x00007ffff74691d5 in GC_push_all_stack_sections (lo=<optimized out>, hi=<optimized out>, traced_stack_sect=0x0) at extra/../mark_rts.c:704
  #2  GC_push_all_stacks () at extra/../pthread_stop_world.c:876
  #3  GC_default_push_other_roots () at extra/../os_dep.c:2893
  #4  0x00007ffff746235c in GC_mark_some (cold_gc_frame=0x7ffee8ecaa50 "`\304G\367\377\177") at extra/../mark.c:374
  #5  0x00007ffff7465a8d in GC_stopped_mark (stop_func=stop_func@entry=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:875
  #6  0x00007ffff7466724 in GC_try_to_collect_inner (stop_func=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:624
  #7  0x00007ffff7466a22 in GC_collect_or_expand (needed_blocks=needed_blocks@entry=1, ignore_off_page=ignore_off_page@entry=0, retry=retry@entry=0) at extra/../alloc.c:1688
  #8  0x00007ffff746878f in GC_allocobj (gran=<optimized out>, kind=<optimized out>) at extra/../alloc.c:1798
  #9  GC_generic_malloc_inner (lb=<optimized out>, k=k@entry=1) at extra/../malloc.c:193
  #10 0x00007ffff746cd40 in GC_generic_malloc_many (lb=<optimized out>, k=<optimized out>, result=<optimized out>) at extra/../mallocx.c:477
  #11 0x00007ffff746cf35 in GC_malloc_kind (bytes=120, kind=1) at extra/../thread_local_alloc.c:187
  #12 0x00007ffff796ede5 in nix::allocBytes (n=<optimized out>, n=<optimized out>) at ../src/libexpr/include/nix/expr/eval-inline.hh:19

This is because it will use the stack pointer of the coroutine, so it
will scan a region of memory that doesn't exist, e.g.

  Stack for thread 0x7ffea4ff96c0 is [0x7ffe80197af0w,0x7ffea4ffa000)

(where 0x7ffe80197af0w is the sp of the coroutine and 0x7ffea4ffa000
is the base of the thread stack).

We don't scan coroutine stacks, because currently they don't have GC
roots (there is no evaluation happening in coroutines). So there is
currently no need to restore the other parts of the original patch,
such as BoehmGCStackAllocator.
edolstra added a commit that referenced this pull request Aug 7, 2025
This was removed in #11152. However,
we need it for the multi-threaded evaluator, because otherwise Boehm
GC will crash while scanning the thread stack:

  #0  GC_push_all_eager (bottom=<optimized out>, top=<optimized out>) at extra/../mark.c:1488
  #1  0x00007ffff74691d5 in GC_push_all_stack_sections (lo=<optimized out>, hi=<optimized out>, traced_stack_sect=0x0) at extra/../mark_rts.c:704
  #2  GC_push_all_stacks () at extra/../pthread_stop_world.c:876
  #3  GC_default_push_other_roots () at extra/../os_dep.c:2893
  #4  0x00007ffff746235c in GC_mark_some (cold_gc_frame=0x7ffee8ecaa50 "`\304G\367\377\177") at extra/../mark.c:374
  #5  0x00007ffff7465a8d in GC_stopped_mark (stop_func=stop_func@entry=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:875
  #6  0x00007ffff7466724 in GC_try_to_collect_inner (stop_func=0x7ffff7453c80 <GC_never_stop_func>) at extra/../alloc.c:624
  #7  0x00007ffff7466a22 in GC_collect_or_expand (needed_blocks=needed_blocks@entry=1, ignore_off_page=ignore_off_page@entry=0, retry=retry@entry=0) at extra/../alloc.c:1688
  #8  0x00007ffff746878f in GC_allocobj (gran=<optimized out>, kind=<optimized out>) at extra/../alloc.c:1798
  #9  GC_generic_malloc_inner (lb=<optimized out>, k=k@entry=1) at extra/../malloc.c:193
  #10 0x00007ffff746cd40 in GC_generic_malloc_many (lb=<optimized out>, k=<optimized out>, result=<optimized out>) at extra/../mallocx.c:477
  #11 0x00007ffff746cf35 in GC_malloc_kind (bytes=120, kind=1) at extra/../thread_local_alloc.c:187
  #12 0x00007ffff796ede5 in nix::allocBytes (n=<optimized out>, n=<optimized out>) at ../src/libexpr/include/nix/expr/eval-inline.hh:19

This is because it will use the stack pointer of the coroutine, so it
will scan a region of memory that doesn't exist, e.g.

  Stack for thread 0x7ffea4ff96c0 is [0x7ffe80197af0w,0x7ffea4ffa000)

(where 0x7ffe80197af0w is the sp of the coroutine and 0x7ffea4ffa000
is the base of the thread stack).

We don't scan coroutine stacks, because currently they don't have GC
roots (there is no evaluation happening in coroutines). So there is
currently no need to restore the other parts of the original patch,
such as BoehmGCStackAllocator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation store Issues and pull requests concerning the Nix store

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Evaluation error in flake regression test suite

4 participants