Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce at-dispose to replace do-block constructors. #309

Merged
merged 1 commit into from
May 25, 2022
Merged

Conversation

maleadt
Copy link
Owner

@maleadt maleadt commented May 25, 2022

I noticed, once again, that our use of do-block constructors (which we use for scoped resource clean-up) resulted in bad code, presumably because of the closures involved (JuliaLang/julia#15276). GPUCompiler already had a bunch of explicit calls to LLVM.dispose to avoid the closure creation in hot code, but that doesn't scale and looks bad. So I explored two possibilities:

  1. make the GC manage resources
  2. replace do-block syntax with something that doesn't require closures

I started out with 1., and a WIP branch can be found here: https://github.com/maleadt/LLVM.jl/tree/tb/finalizers. Although initial results were promising, I ditched the effort after realizing two major issues:

  • objects referring to the same memory can be constructed separately, e.g., a LLVM.Module object can also be created by calling parent(::BasicBlock). That would then require handle-based refcounting, or an object factory like CUDA.jl's unique Context constructors (which suffer from the same problem as you can look up the current context using API calls), both of which are messy.
  • objects can be constructed without a reference to its owning parent. For example, LLVM.Function needs to keep its parent Module alive, but you can also look-up a function from a CallInst. Our generic infrastructure now assumes that you can create all Values equally, whereas in the new design Function objects (which are a subtype of Value) would now also need a Module argument. Worse, it's not always possible to look that up by calling parent, as instructions can be deleted from their parent BasicBlock without their memory being reclaimed. In that scenario, we'd end up with a Function object that doesn't root its parent Module, potentially resulting in early-frees.

Given this complexity, I decided to go with 2. and keep the responsibility of tying object lifetimes to their respective owners to the user. This has also been working fine, in part because we dispose early which results in bugs being discovered relatively early, whereas bugs in a finalizer-based approach may be lurking for a long time.

The result is a @dispose macro which replaces the do-block syntax. The effect is pretty dramatic: on the LLVM.jl test-suite, perf reports that the total executed instruction count drops by 20%!

$ perf stat -B /home/tim/.cache/jl/installs/bin/linux/x64/1.8/julia-1.8-latest-linux-x86_64/bin/julia --project test/runtests.jl                                             [master]
┌ Warning: It is recommended to run the LLVM.jl test suite with -g2
└ @ Main ~/Julia/pkg/LLVM/test/runtests.jl:5
JIT session error: Symbols not found: [ mysum ]
Test Summary: | Pass  Total   Time
LLVM          | 1686   1686  19.3s

 Performance counter stats for '/home/tim/.cache/jl/installs/bin/linux/x64/1.8/julia-1.8-latest-linux-x86_64/bin/julia --project test/runtests.jl':

         24,574.63 msec task-clock                #    1.250 CPUs utilized
               982      context-switches          #   39.960 /sec
                85      cpu-migrations            #    3.459 /sec
           135,062      page-faults               #    5.496 K/sec
   110,672,199,590      cycles                    #    4.504 GHz                      (82.48%)
       445,623,166      stalled-cycles-frontend   #    0.40% frontend cycles idle     (81.61%)
     2,037,997,644      stalled-cycles-backend    #    1.84% backend cycles idle      (83.41%)
   166,889,246,973      instructions              #    1.51  insn per cycle
                                                  #    0.01  stalled cycles per insn  (84.16%)
    33,943,204,016      branches                  #    1.381 G/sec                    (84.21%)
     1,016,486,605      branch-misses             #    2.99% of all branches          (84.20%)

      19.664647623 seconds time elapsed

      20.904811000 seconds user
       3.653577000 seconds sys
Test Summary: | Pass  Total   Time
LLVM          | 1676   1676  15.2s

 Performance counter stats for '/home/tim/.cache/jl/installs/bin/linux/x64/1.8/julia-1.8-latest-linux-x86_64/bin/julia --project test/runtests.jl':

         20,417.89 msec task-clock                #    1.316 CPUs utilized
             1,011      context-switches          #   49.515 /sec
                85      cpu-migrations            #    4.163 /sec
           126,260      page-faults               #    6.184 K/sec
    94,119,265,188      cycles                    #    4.610 GHz                      (82.71%)
       377,228,483      stalled-cycles-frontend   #    0.40% frontend cycles idle     (81.26%)
     1,948,056,334      stalled-cycles-backend    #    2.07% backend cycles idle      (83.07%)
   136,612,268,983      instructions              #    1.45  insn per cycle
                                                  #    0.01  stalled cycles per insn  (84.33%)
    27,600,681,223      branches                  #    1.352 G/sec                    (84.37%)
       874,709,734      branch-misses             #    3.17% of all branches          (84.36%)

      15.513353511 seconds time elapsed

      16.665442000 seconds user
       3.742267000 seconds sys

The effect on GPUCompiler is also impressive:

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 IR generation                  218    6.54s   53.8%  30.0ms   1.68GiB   72.3%  7.87MiB
   rewrite                      218    3.39s   27.9%  15.6ms    259MiB   10.9%  1.19MiB
     lower throw                174   68.3ms    0.6%   393μs   14.6MiB    0.6%  86.1KiB
     hide trap                   61   2.07ms    0.0%  33.9μs   30.5KiB    0.0%     512B
     hide unreachable            78   1.72ms    0.0%  22.1μs    149KiB    0.0%  1.91KiB
       find                      78   1.02ms    0.0%  13.1μs   17.6KiB    0.0%     231B
       predecessors              78    266μs    0.0%  3.41μs   19.6KiB    0.0%     257B
       replace                   78   42.7μs    0.0%   548ns   1.88KiB    0.0%    24.6B
     lower throw (extra)         43    327μs    0.0%  7.61μs   11.3KiB    0.0%     270B
   emission                     218    1.63s   13.4%  7.47ms   1.00GiB   43.1%  4.69MiB
   clean-up                     218   1.05ms    0.0%  4.80μs   62.2KiB    0.0%     292B
 IR post-processing             218    3.82s   31.4%  17.5ms    300MiB   12.6%  1.38MiB
   optimization                  96    3.72s   30.6%  38.8ms    297MiB   12.5%  3.09MiB
     nvvmreflect                223   32.9μs    0.0%   147ns     0.00B    0.0%    0.00B
   clean-up                     218   8.93ms    0.1%  41.0μs   10.2KiB    0.0%    48.0B
 Validation                      16    1.26s   10.4%  78.7ms    239MiB   10.0%  14.9MiB
 lower byval                     17    224ms    1.8%  13.2ms   37.7MiB    1.6%  2.22MiB
 LLVM back-end                   38    197ms    1.6%  5.18ms   18.4MiB    0.8%   495KiB
   machine-code generation       38    167ms    1.4%  4.40ms   10.4MiB    0.4%   281KiB
     remove freeze                1   13.9ms    0.1%  13.9ms   2.19MiB    0.1%  2.19MiB
     remove trap                  1    667ns    0.0%   667ns     0.00B    0.0%    0.00B
   preparation                   38   29.7ms    0.2%   782μs   7.94MiB    0.3%   214KiB
 validation                     221    104ms    0.9%   470μs   63.0MiB    2.7%   292KiB
 Library linking                218   11.3ms    0.1%  51.7μs   65.7KiB    0.0%     308B
   runtime library               21   10.7ms    0.1%   512μs   2.95KiB    0.0%     144B
   target libraries             102   42.0μs    0.0%   412ns     0.00B    0.0%    0.00B
 Julia front-end                219   1.26ms    0.0%  5.77μs   94.7KiB    0.0%     443B
 Debug info removal              37   43.0μs    0.0%  1.16μs     0.00B    0.0%    0.00B
 ──────────────────────────────────────────────────────────────────────────────────────

Test Summary: | Pass  Broken  Total   Time
GPUCompiler   |  134       4    138  38.6s

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 IR generation                  218    3.88s   49.4%  17.8ms   1.52GiB   71.2%  7.14MiB
   emission                     218    1.60s   20.4%  7.35ms   1.00GiB   46.7%  4.68MiB
   rewrite                      218    822ms   10.5%  3.77ms    103MiB    4.7%   485KiB
     lower throw                174    122ms    1.6%   701μs   14.6MiB    0.7%  86.1KiB
     hide trap                   61   2.06ms    0.0%  33.7μs   30.5KiB    0.0%     512B
     hide unreachable            78   1.65ms    0.0%  21.2μs    149KiB    0.0%  1.91KiB
       find                      78   1.01ms    0.0%  12.9μs   17.6KiB    0.0%     231B
       predecessors              78    249μs    0.0%  3.19μs   19.6KiB    0.0%     257B
       replace                   78   44.8μs    0.0%   575ns   1.88KiB    0.0%    24.6B
     lower throw (extra)         43    328μs    0.0%  7.63μs   11.3KiB    0.0%     268B
   clean-up                     218   1.03ms    0.0%  4.71μs   61.9KiB    0.0%     291B
 IR post-processing             218    2.27s   28.8%  10.4ms    275MiB   12.6%  1.26MiB
   optimization                  96    2.17s   27.6%  22.6ms    271MiB   12.4%  2.83MiB
     nvvmreflect                223   31.9μs    0.0%   143ns     0.00B    0.0%    0.00B
   clean-up                     218   8.45ms    0.1%  38.8μs   10.2KiB    0.0%    48.0B
 Validation                      16    1.26s   16.0%  78.6ms    239MiB   10.9%  14.9MiB
 LLVM back-end                   38    183ms    2.3%  4.80ms   18.3MiB    0.8%   494KiB
   machine-code generation       38    159ms    2.0%  4.18ms   10.4MiB    0.5%   280KiB
     remove freeze                1   5.85ms    0.1%  5.85ms   2.19MiB    0.1%  2.19MiB
     remove trap                  1    583ns    0.0%   583ns     0.00B    0.0%    0.00B
   preparation                   38   23.5ms    0.3%   619μs   7.94MiB    0.4%   214KiB
 lower byval                     17    162ms    2.1%  9.53ms   34.9MiB    1.6%  2.05MiB
 validation                     221   97.8ms    1.2%   443μs   62.7MiB    2.9%   290KiB
 Library linking                218   11.5ms    0.1%  52.8μs   65.7KiB    0.0%     308B
   runtime library               21   11.0ms    0.1%   522μs   2.95KiB    0.0%     144B
   target libraries             102   45.6μs    0.0%   447ns     0.00B    0.0%    0.00B
 Julia front-end                219   1.23ms    0.0%  5.59μs   94.7KiB    0.0%     443B
 Debug info removal              37   40.3μs    0.0%  1.09μs     0.00B    0.0%    0.00B
 ──────────────────────────────────────────────────────────────────────────────────────

Test Summary: | Pass  Broken  Total   Time
GPUCompiler   |  134       4    138  34.5s

So a 10% improvement on fairly realistic use of LLVM.jl.

src/base.jl Outdated Show resolved Hide resolved
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants