-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very inefficient GC frame generation #15369
Comments
Example from #15717 function f_simd(n::Integer)
a = zeros(Float32, n)
@inbounds @simd for i in eachindex(a)
a[i] += 1
end
a
end Vectorization works with normal opt level (on llvm 3.8 at least) but the function still generate 8 roots instead of 1 (for |
Code from #15457 julia> f2() = ccall(:jl_get_cpu_name, Ref{String}, ())
f2 (generic function with 1 method) define %jl_value_t* @julia_f2_49667() #0 !dbg !6 {
top:
%0 = call %jl_value_t*** @jl_get_ptls_states()
%1 = alloca [3 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %1, i64 0, i64 0
%2 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %1, i64 0, i64 2
store %jl_value_t* null, %jl_value_t** %2, align 8
%3 = bitcast [3 x %jl_value_t*]* %1 to i64*
store i64 2, i64* %3, align 8
%4 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %1, i64 0, i64 1
%5 = bitcast %jl_value_t*** %0 to i64*
%6 = load i64, i64* %5, align 8
%7 = bitcast %jl_value_t** %4 to i64*
store i64 %6, i64* %7, align 8
store %jl_value_t** %.sub, %jl_value_t*** %0, align 8
%8 = call %jl_value_t* inttoptr (i64 140532991692000 to %jl_value_t* ()*)()
store %jl_value_t* %8, %jl_value_t** %2, align 8
%9 = load i64, i64* %7, align 8
store i64 %9, i64* %5, align 8
ret %jl_value_t* %8
} Should have no GC root instead of 1. |
Move from #15402 immutable Wrap{T}
x::T
end
function f(a::Wrap)
s = 0.0
for i = 1:length(a.x)
@inbounds s += a.x[i]
end
s
end define double @julia_f_49649(%jl_value_t*) #0 !dbg !6 {
top:
%1 = call %jl_value_t*** @jl_get_ptls_states()
%2 = alloca [4 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 0
%3 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 2
store %jl_value_t* null, %jl_value_t** %3, align 8
%4 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 3
store %jl_value_t* null, %jl_value_t** %4, align 8
%5 = bitcast [4 x %jl_value_t*]* %2 to i64*
store i64 4, i64* %5, align 8
%6 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 1
%7 = bitcast %jl_value_t*** %1 to i64*
%8 = load i64, i64* %7, align 8
%9 = bitcast %jl_value_t** %6 to i64*
store i64 %8, i64* %9, align 8
store %jl_value_t** %.sub, %jl_value_t*** %1, align 8
%10 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 0, i32 0
%11 = load %jl_value_t*, %jl_value_t** %10, align 8
store %jl_value_t* %11, %jl_value_t** %3, align 8
%12 = getelementptr inbounds %jl_value_t, %jl_value_t* %11, i64 1
%13 = bitcast %jl_value_t* %12 to i64*
%14 = load i64, i64* %13, align 8
%15 = icmp slt i64 %14, 1
br i1 %15, label %L2, label %if.lr.ph
if.lr.ph: ; preds = %top
%16 = bitcast %jl_value_t* %11 to double**
%17 = load double*, double** %16, align 8
br label %if
L.L2_crit_edge: ; preds = %if
store %jl_value_t* %11, %jl_value_t** %4, align 8
br label %L2
L2: ; preds = %L.L2_crit_edge, %top
%s.0.lcssa = phi double [ %23, %L.L2_crit_edge ], [ 0.000000e+00, %top ]
%18 = load i64, i64* %9, align 8
store i64 %18, i64* %7, align 8
ret double %s.0.lcssa
if: ; preds = %if.lr.ph, %if
%s.04 = phi double [ 0.000000e+00, %if.lr.ph ], [ %23, %if ]
%"#temp#.03" = phi i64 [ 1, %if.lr.ph ], [ %19, %if ]
%19 = add i64 %"#temp#.03", 1
%20 = add i64 %"#temp#.03", -1
%21 = getelementptr double, double* %17, i64 %20
%22 = load double, double* %21, align 8
%23 = fadd double %s.04, %22
%24 = icmp eq i64 %"#temp#.03", %14
br i1 %24, label %L.L2_crit_edge, label %if
} The inner loop is not affected anymore (and if the type is changed to integer or adding |
will these all likely have the same eventual fix? |
It depends. Once someone starts to work on this, we can make a check list of the cases in this issue or open new ones if necessary. |
Given that @Keno's work should be a fix-all solution we can stick to the plan of using this issue to track all inefficient test cases. So copying the case from #20981 over and closing that one.
|
Is the idea that his PR should fix all of these in one fell swoop, so we just collect them here to make sure they're all actually fixed? |
It's a pretty general solution. All remaining problems should have their own issues. |
can we regression test these somehow? |
After codegen_rewrite2, codegen is not able to reuse GC frame slots in the same basic block anymore, for both jlcall buffer and (especially) temporaries.
Testing with the following simple functions
On 2bb94d6 (juliabox version, before jn/codegen_rewrite2 and after jb/functions so that the jlcall convention is the same with current master).
f1()
andf2()
both generate only 2 gc frame slots (note that we emit the temporary directly into the jlcall frame inf2()
and the jlcall frames are always reused).And on current master it generates 4 (4 temporaries) and 8 (4 temporaries and 2 * 2 jlcall frames without any reuse...)
Adding a useless branch that can be constant folded by llvm (but not type inference) causes the jlcall buffer to be reused but not the temporaries.
The text was updated successfully, but these errors were encountered: