Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Create local gc frame for throw #11508

Closed
wants to merge 6 commits into from

Conversation

yuyichao
Copy link
Contributor

This PR inplements the idea proposed by @carnaval in #11284 (comment) to create a local gc frame for throwing errors.

TL;DR
I think this idea can be slightly generalized and combine with #11284 to get the best performance for (not) throwing errors.

Benchmark:
The script used to benchmark can be found here.

Timing output on current master (Full output with llvm ir)

f0
   1.428 seconds
f1
   1.462 seconds
f2
   1.592 seconds
getindex
   1.747 seconds

Timing output with this PR (Full output with llvm ir)

f0
   1.388 seconds
f1
   1.466 seconds
f2
   1.535 seconds
getindex
   1.478 seconds

(Relaxant number is the timing for f2 and getindex)

As shown in the benchmark, creating a local gc frame (if there isn't a global one yet) can boost the performance for complex functions but a #11284 style transformation can still be useful especally for simple functions.

Currently this PR only has a special case for throw, which is easy because the control flow terminates here and no rooted variables can be leaked. It is probably possible to generalize this to other blocks that doesn't have too complicated control flow (or at least when calling functions that does not return). In that case, it can probably be used together with #11284 for the best performance.

@yuyichao
Copy link
Contributor Author

As for the sysimg size, this PR increases it by 176 bytes compare to current master after stripping.

Probably a combination of exceptions with arguments rarely be used and that gc frame does not need a lot of instructions.

@yuyichao
Copy link
Contributor Author

Merging GC frames should be working now.

As shown below, the initialization of the sub GC frame elements are not removed yet (first three lines in the if block for k) (shouldn't be too hard).

I'm pretty sure the debugging info for the GC root of the subframes will be broken after the merge (or maybe llvm does some magic to preserve the debugging info when a instruction is replaced?) @Keno

So far the logic are all quite simple. But before I go much further, what's the plan for using llvm's gc support feature? Is that going to deprecate small optimizations like this? @carnaval

julia> @noinline g(a::ANY, b::ANY, c::ANY) = BoundsError(a, b)
g (generic function with 1 method)

julia> function f(a, b, c)
       if (a == 0)
       throw(g(a, b, c))
       end
       a
       end
f (generic function with 1 method)

julia> function k(a, b, c)
       if (a == 0)
       throw(g(a, b, c))
       end
       g(a, a, a)
       end
k (generic function with 1 method)

julia> @code_llvm f(1, 2, 3)

define i64 @julia_f_20959(i64, i64, i64) {
top:
  %3 = icmp eq i64 %0, 0
  br i1 %3, label %if, label %L

if:                                               ; preds = %top
  %4 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %4, i64 0, i64 0
  %5 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 2
  %6 = bitcast [5 x %jl_value_t*]* %4 to i64*
  store i64 6, i64* %6, align 8
  %7 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 1
  %8 = bitcast %jl_value_t** %7 to %jl_value_t***
  %9 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %9, %jl_value_t*** %8, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %5, align 8
  %10 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %10, align 8
  %11 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = call %jl_value_t* @jl_box_int64(i64 signext 0)
  store %jl_value_t* %12, %jl_value_t** %5, align 8
  %13 = call %jl_value_t* @jl_box_int64(i64 signext %1)
  store %jl_value_t* %13, %jl_value_t** %10, align 8
  %14 = call %jl_value_t* @jl_box_int64(i64 signext %2)
  store %jl_value_t* %14, %jl_value_t** %11, align 8
  %15 = call %jl_value_t* @julia_g_20960(%jl_value_t* inttoptr (i64 140728708276624 to %jl_value_t*), %jl_value_t** %5, i32 3)
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 3)
  %16 = load %jl_value_t*** %8, align 8
  store %jl_value_t** %16, %jl_value_t*** @jl_pgcstack, align 8
  br label %L

L:                                                ; preds = %if, %top
  ret i64 %0
}

julia> @code_llvm k(1, 2, 3)

define %jl_value_t* @julia_k_20967(i64, i64, i64) {
top:
  %3 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %3, i64 0, i64 0
  %4 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 2
  %5 = bitcast [5 x %jl_value_t*]* %3 to i64*
  store i64 6, i64* %5, align 8
  %6 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 1
  %7 = bitcast %jl_value_t** %6 to %jl_value_t***
  %8 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %8, %jl_value_t*** %7, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %9 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %10, align 8
  %11 = icmp eq i64 %0, 0
  br i1 %11, label %if, label %L

if:                                               ; preds = %top
  store %jl_value_t* null, %jl_value_t** %4, align 8
  store %jl_value_t* null, %jl_value_t** %9, align 8
  store %jl_value_t* null, %jl_value_t** %10, align 8
  %12 = call %jl_value_t* @jl_box_int64(i64 signext 0)
  store %jl_value_t* %12, %jl_value_t** %4, align 8
  %13 = call %jl_value_t* @jl_box_int64(i64 signext %1)
  store %jl_value_t* %13, %jl_value_t** %9, align 8
  %14 = call %jl_value_t* @jl_box_int64(i64 signext %2)
  store %jl_value_t* %14, %jl_value_t** %10, align 8
  %15 = call %jl_value_t* @julia_g_20960(%jl_value_t* inttoptr (i64 140728708276624 to %jl_value_t*), %jl_value_t** %4, i32 3)
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 3)
  br label %L

L:                                                ; preds = %if, %top
  %16 = call %jl_value_t* @jl_box_int64(i64 signext %0)
  store %jl_value_t* %16, %jl_value_t** %4, align 8
  %17 = call %jl_value_t* @jl_box_int64(i64 signext %0)
  store %jl_value_t* %17, %jl_value_t** %9, align 8
  %18 = call %jl_value_t* @jl_box_int64(i64 signext %0)
  store %jl_value_t* %18, %jl_value_t** %10, align 8
  %19 = call %jl_value_t* @julia_g_20960(%jl_value_t* inttoptr (i64 140728708276624 to %jl_value_t*), %jl_value_t** %4, i32 3)
  %20 = load %jl_value_t*** %7, align 8
  store %jl_value_t** %20, %jl_value_t*** @jl_pgcstack, align 8
  ret %jl_value_t* %19
}

@yuyichao
Copy link
Contributor Author

The logic used here should be the same with adjusting/restoring argDepth in other parts of the codegen (and even more conservative). However, is it garenteed that any variables that is going to be used in a block will be rooted (in codegen sense). I can see that all local variables are rooted at the beginning of the function so I guess the question should be, is local variables the only value that can be shared between different (julia) blocks?

Example: Can the following happen?

V # somehow not a local variable and not rooted
L:
# Some code that uses `V` and we restore `argDepth` after codgen this part.
V = .... # Assigning to V that need boxing and we root V here since argDepth < maxDepth
# the slot we assign can be override by the line above. cause a missing GC root.
goto L

OK I know in this case V is a local variable and it will be rooted before block L is generated. And that's basically my question, does this have to be the case? Can block L use a value from another block that is not a local variable?

@yuyichao
Copy link
Contributor Author

The unneeded NULL assignment are removed (see below).

This change actually causes a 24k decrease in the final sysimg size (compare to the current master)... I guess it's because llvm has more freedom to optimize the code??

julia> @noinline g(a::ANY, b::ANY, c::ANY) = BoundsError(a, b)
g (generic function with 1 method)                                                

julia> function k(a, b, c)
       if (a == 0)
       throw(g(a, b, c))
       end
       g(a, a, a)
       end
k (generic function with 1 method)                                                

julia> function f(a, b, c)
       if (a == 0)
       throw(g(a, b, c))
       end
       a
       end
f (generic function with 1 method)                                                

julia> @code_llvm f(1, 2, 3)

define i64 @julia_f_20969(i64, i64, i64) {                                        
top:                                                                              
  %3 = icmp eq i64 %0, 0                                                          
  br i1 %3, label %if, label %L                                                   

if:                                               ; preds = %top
  %4 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %4, i64 0, i64 0
  %5 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 2
  %6 = bitcast [5 x %jl_value_t*]* %4 to i64*
  store i64 6, i64* %6, align 8
  %7 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 1
  %8 = bitcast %jl_value_t** %7 to %jl_value_t***
  %9 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %9, %jl_value_t*** %8, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %5, align 8
  %10 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %10, align 8
  %11 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = call %jl_value_t* @jl_box_int64(i64 signext 0)
  store %jl_value_t* %12, %jl_value_t** %5, align 8
  %13 = call %jl_value_t* @jl_box_int64(i64 signext %1)
  store %jl_value_t* %13, %jl_value_t** %10, align 8
  %14 = call %jl_value_t* @jl_box_int64(i64 signext %2)
  store %jl_value_t* %14, %jl_value_t** %11, align 8
  %15 = call %jl_value_t* @julia_g_20970(%jl_value_t* inttoptr (i64 140728706842768 to %jl_value_t*), %jl_value_t** %5, i32 3)
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 3)
  %16 = load %jl_value_t*** %8, align 8
  store %jl_value_t** %16, %jl_value_t*** @jl_pgcstack, align 8
  br label %L

L:                                                ; preds = %if, %top
  ret i64 %0
}

julia> @code_llvm k(1, 2, 3)

define %jl_value_t* @julia_k_20982(i64, i64, i64) {
top:
  %3 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %3, i64 0, i64 0
  %4 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 2
  %5 = bitcast [5 x %jl_value_t*]* %3 to i64*
  store i64 6, i64* %5, align 8
  %6 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 1
  %7 = bitcast %jl_value_t** %6 to %jl_value_t***
  %8 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %8, %jl_value_t*** %7, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %9 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %10, align 8
  %11 = icmp eq i64 %0, 0
  br i1 %11, label %if, label %L

if:                                               ; preds = %top
  %12 = call %jl_value_t* @jl_box_int64(i64 signext 0)
  store %jl_value_t* %12, %jl_value_t** %4, align 8
  %13 = call %jl_value_t* @jl_box_int64(i64 signext %1)
  store %jl_value_t* %13, %jl_value_t** %9, align 8
  %14 = call %jl_value_t* @jl_box_int64(i64 signext %2)
  store %jl_value_t* %14, %jl_value_t** %10, align 8
  %15 = call %jl_value_t* @julia_g_20970(%jl_value_t* inttoptr (i64 140728706842768 to %jl_value_t*), %jl_value_t** %4, i32 3)
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 3)
  br label %L

L:                                                ; preds = %if, %top
  %16 = call %jl_value_t* @jl_box_int64(i64 signext %0)
  store %jl_value_t* %16, %jl_value_t** %4, align 8
  %17 = call %jl_value_t* @jl_box_int64(i64 signext %0)
  store %jl_value_t* %17, %jl_value_t** %9, align 8
  %18 = call %jl_value_t* @jl_box_int64(i64 signext %0)
  store %jl_value_t* %18, %jl_value_t** %10, align 8
  %19 = call %jl_value_t* @julia_g_20970(%jl_value_t* inttoptr (i64 140728706842768 to %jl_value_t*), %jl_value_t** %4, i32 3)
  %20 = load %jl_value_t*** %7, align 8
  store %jl_value_t** %20, %jl_value_t*** @jl_pgcstack, align 8
  ret %jl_value_t* %19
}

@carnaval
Copy link
Contributor

carnaval commented Jun 2, 2015

So about your question, no, only local variables (including gensyms) survive control flow edges (in fact it is stronger than that : it is the only way for a value to survive statement boundary).

I like the fact that your implementation is "minimal disturbance" but as you said it's not very general since it only handles what happens inside the argument list of throw. It would eventually go away if we switch to the statepoint based infrastructure but we are nowhere near that. I'm on the fence about merging this since it clutters the code a bit (but not much) and we do not desperately need it either.

I don't think it would be very hard to make this construct happen on the minimal dominator of the throw call, just a bit more bookkeeping. It would make it more resilient to code transforms such as putting a throw argument in a local.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 2, 2015

So about your question, no, only local variables (including gensyms) survive control flow edges (in fact it is stronger than that : it is the only way for a value to survive statement boundary).

Good. That makes this kind of optimization much easier....

I like the fact that your implementation is "minimal disturbance" but as you said it's not very general since it only handles what happens inside the argument list of throw. It would eventually go away if we switch to the statepoint based infrastructure but we are nowhere near that. I'm on the fence about merging this since it clutters the code a bit (but not much) and we do not desperately need it either.

By code cluttering do you mean sth like these 1 2 3. There are some cleanup commits (including the ones above). I didn't squash all commits this time because I think it might be better to split out those commits in another PR. Should I do that to make what this PR is doing more clear?

I don't think it would be very hard to make this construct happen on the minimal dominator of the throw call, just a bit more bookkeeping. It would make it more resilient to code transforms such as putting a throw argument in a local.

I'm experimenting with that.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 2, 2015

Actually I've tried to create a local gc frame for all function with Bottom return type but it throws error during boot strap....

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 4, 2015

@carnaval The clean up part of this PR is split out to #11569 (which is still included in this PR) and only the last 3 commits in this PR is about setting up local GC frames. See the links below for the real diff. (The commit should also be relatively clear on what they are doing, I hope.)

yuyichao/julia@gcframe-cleanup...yuyichao:throw-local-gc
yuyichao/julia@yuyichao:94ca799...yuyichao:e469ab5

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 5, 2015

This PR is becoming not limited to codegen of the local GC frame anymore.

In the last two commits, I've added two additional optimizations.

  1. throw and it's arguements are not inlined anymore, this saves a lot of instructions in the caller.
  2. There's a missing unreachable when emitting throw. 90fd1b3#diff-6d4d21428a67320600faf5a1a9f3a16aR2235 . This also saves some instructions and some unnecessary works.

Benchmark follow soon.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 5, 2015

It turns out that my old benchmark where mainly benchmarking the jlcall wrapper and other overhead like that. I've updated my code to do c call directly.

Benchmark script

#!/usr/bin/julia -f

macro throw_interp(typ, arg1, arg2)
    :(throw(typ(string("Random format string, ", $arg1, ", ", $arg2))))
end

@noinline throw1(typ::ANY, arg1, arg2) = throw(typ(arg1, arg2))
@noinline throw_interp(typ::Any, arg1, arg2) =
    @throw_interp(typ, arg1, arg2)

function f0(a, b)
    if b === 0
        a
    end
    a
end

function f1(a, b)
    if b === 0
        throw1(BoundsError, a, b)
    end
    a
end

function f2(a, b)
    if b === 0
        throw(BoundsError(a, b))
    end
    a
end

function f3(a, b)
    if b === 0
        throw_interp(ArgumentError, a, b)
    end
    a
end

function f4(a, b)
    if b === 0
        @throw_interp(ArgumentError, a, b)
    end
    a
end

macro time_func(f, args...)
    args = eval(current_module(), Expr(:tuple, args...))::Tuple
    argnames = Symbol[gensym() for i in 1:length(args)]
    types = map(typeof, args)
    quote
        function wrapper($(argnames...))
            $(Expr(:meta, :noinline))
            $f($(argnames...))
        end
        function timing_wrapper()
            println($f, $types)
            wrapper($(args...))
            gc()
            @time for i in 1:1000000000
                wrapper($(args...))
            end
            gc()
        end
        timing_wrapper()
    end
end

@code_llvm f0(0, 1)
@code_llvm f1(0, 1)
@code_llvm f2(0, 1)
@code_llvm f3(0, 1)
@code_llvm f4(0, 1)

v = Base.svec(1, 2)
@code_llvm getindex(v, 1)

@time_func(f0, 0, 1)
@time_func(f1, 0, 1)
@time_func(f2, 0, 1)
@time_func(f3, 0, 1)
@time_func(f4, 0, 1)
@time_func(getindex, v, 1)

This now covers the majority (if not all) of the usecase in Base.

Benchmark result in next comment.

Edit: Script update

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 5, 2015

The benchmarks are done on

Result (Only numbers are pasted here, for full output including the LLVM IR, see links to the corresponding GIST):

  1. master / #11569

    f0(Int64,Int64)
       1.593 seconds     
    f1(Int64,Int64)
       2.230 seconds     
    f2(Int64,Int64)
       3.867 seconds     
    f3(Int64,Int64)
       2.250 seconds     
    f4(Int64,Int64)
       4.145 seconds     
    getindex(SimpleVector,Int64)
       4.796 seconds     
  2. #11284 Helper function

    f0(Int64,Int64)
       1.603 seconds     
    f1(Int64,Int64)
       2.230 seconds     
    f2(Int64,Int64)
       1.918 seconds     
    f3(Int64,Int64)
       2.287 seconds     
    f4(Int64,Int64)
       4.149 seconds     
    getindex(SimpleVector,Int64)
       5.266 seconds     
  3. This PR

    f0(Int64,Int64)
       1.605 seconds     
    f1(Int64,Int64)
       2.256 seconds     
    f2(Int64,Int64)
       2.238 seconds     
    f3(Int64,Int64)
       2.230 seconds     
    f4(Int64,Int64)
       2.552 seconds     
    getindex(SimpleVector,Int64)
       3.524 seconds     
  • Analysis and conclusion
    1. Throwing a error slow down the non-error path by 2x for simple functions
    2. Using a non-inline thrower is probably the best one can get
    3. The helper function approach only works for simple cases (It should work for the getindex but is not due to jlcall generated for function with only jl_value_t* arguments #11304)
    4. The mixed approach in this PR can almost get the performance of a custom thrower. Even for the string interpolation case (which is more complex than most error throwing code) the difference is very small.

This PR might not have a big impact on the performance in current Base mainly because people have been pretty careful about it. Nonetheless, we can see from the getindex function that it can improve the performance of a (relatively) complicated function. It's still a little hard to tell how much impact creating a GC frame is in a bigger function but these simple benchmarks should show that throwing the error is almost free for the normal path now.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 5, 2015

It turns out that I was still benchmarking the wrapper (althought not dynamic dispatch) for the last function. Benchmark script and result updated.

@ScottPJones
Copy link
Contributor

👍 💯 This would have saved me a LOT of grief, if it had already been in Julia... please get this merged ASAP 😀 I love the thorough micro benchmarking!

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 5, 2015

The commit that turns off julia level inlining for "throw expression" doesn't have a very big impact on the string interpolation case but brings the number for simple exception constructor case down to the noinline-thrower case. (Edit: and was comparable to the string interpolation case (~10% slower) without noinline)

The drawbacks of this change are:

  1. sys.so after stripping increase by 0.6%.
  2. The singleton constructor is also not inlined (which probably also contribute a little to the sysimg size increase). I could try increasing inlining threashold instead of disable inlining but not sure if it is worth doing...
  3. This also suffers from jlcall generated for function with only jl_value_t* arguments #11304 a little bit but should not be very much and only for the error path.

@mbauman
Copy link
Sponsor Member

mbauman commented Jun 5, 2015

👍 I'm not qualified to review this code, but I really like the outcomes here. It will be very nice to not need to worry about throw(…) affecting unexceptional hot-loops anymore.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2015

Interestingly this time AppVeyor didn't freeze .....

@yuyichao yuyichao mentioned this pull request Jun 10, 2015
…alid instructions in finalize_gc_frame", which seems to cause windows CI hanging."

This reverts commit 7b1895a.
@yuyichao
Copy link
Contributor Author

Update for 190afa6

@yuyichao yuyichao added compiler:codegen Generation of LLVM IR and native code GC Garbage collector labels Jun 13, 2015
yuyichao referenced this pull request Jun 22, 2015
this allows the lazy emission & addition of static gc roots

plus, this eliminates the need to delete the gc frame (since llvm can
trivially do so) and makes emit_gcpop implicit at all ret instructions,
rather than explicit
@yuyichao yuyichao added the performance Must go faster label Jun 30, 2015
@yuyichao
Copy link
Contributor Author

Close in favor of @vtjnash 's jn/codegen_rewrite2 branch. The approach there is more general and should also fit better with other changes we need in codegen. There are so many changed on master so this need to be rewriten anyway...

@yuyichao yuyichao closed this Dec 16, 2015
@yuyichao yuyichao deleted the throw-local-gc branch December 16, 2015 03:15
@hayd
Copy link
Member

hayd commented Jan 8, 2016

xlink #14543

Will that fix the perf for the other issues linked here e.g. #12152?

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code GC Garbage collector performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants