Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very inefficient GC frame generation #15369

Closed
yuyichao opened this issue Mar 5, 2016 · 9 comments · Fixed by #21888
Closed

Very inefficient GC frame generation #15369

yuyichao opened this issue Mar 5, 2016 · 9 comments · Fixed by #21888
Labels
compiler:codegen Generation of LLVM IR and native code GC Garbage collector performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@yuyichao
Copy link
Contributor

yuyichao commented Mar 5, 2016

After codegen_rewrite2, codegen is not able to reuse GC frame slots in the same basic block anymore, for both jlcall buffer and (especially) temporaries.

Testing with the following simple functions

julia> @noinline g1(a, b) = nothing
g1 (generic function with 1 method)

julia> @noinline g2(a...) = nothing
g2 (generic function with 1 method)

julia> function f1()
           g1(Ref(1), Ref(2))
           g1(Ref(1), Ref(2))
       end
f1 (generic function with 1 method)

julia> function f2()
           g2(Ref(1), Ref(2))
           g2(Ref(1), Ref(2))
       end
f2 (generic function with 1 method)

On 2bb94d6 (juliabox version, before jn/codegen_rewrite2 and after jb/functions so that the jlcall convention is the same with current master). f1() and f2() both generate only 2 gc frame slots (note that we emit the temporary directly into the jlcall frame in f2() and the jlcall frames are always reused).

define void @julia_f1_23552() #0 {
top:                                                                             
  %0 = alloca [4 x %jl_value_t*], align 8                                        
  %.sub = getelementptr inbounds [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 
0, i64 0                                                                         
  %1 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 0, i64 2    
  %2 = bitcast [4 x %jl_value_t*]* %0 to i64*                                    
  store i64 4, i64* %2, align 8                                                  
  %3 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 0, i64 1    
  %4 = load i64, i64* bitcast ([80040 x i8*]* @jl_tls_states to i64*), align 8   
  %5 = bitcast %jl_value_t** %3 to i64*                                          
  store i64 %4, i64* %5, align 8                                                 
  store %jl_value_t** %.sub, %jl_value_t*** bitcast ([80040 x i8*]* @jl_tls_state
s to %jl_value_t***), align 8                                                    
  store %jl_value_t* null, %jl_value_t** %1, align 8                             
  %6 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 0, i64 3    
  store %jl_value_t* null, %jl_value_t** %6, align 8                             
  %7 = call %jl_value_t* @jl_gc_alloc_1w()                                       
  %8 = getelementptr inbounds %jl_value_t, %jl_value_t* %7, i64 -1, i32 0        
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %8, align 8                                                                    
  store %jl_value_t* %7, %jl_value_t** %1, align 8                               
  %9 = load i64, i64* inttoptr (i64 139617449926784 to i64*), align 128          
  %10 = bitcast %jl_value_t* %7 to i64*                                          
  store i64 %9, i64* %10, align 16                                               
  store %jl_value_t* %7, %jl_value_t** %1, align 8                               
  %11 = call %jl_value_t* @jl_gc_alloc_1w()                                      
  %12 = getelementptr inbounds %jl_value_t, %jl_value_t* %11, i64 -1, i32 0      
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %12, align 8                                                                   
  store %jl_value_t* %11, %jl_value_t** %6, align 8                              
  %13 = load i64, i64* inttoptr (i64 139617449926832 to i64*), align 16          
  %14 = bitcast %jl_value_t* %11 to i64*                                         
  store i64 %13, i64* %14, align 16                                              
  store %jl_value_t* %11, %jl_value_t** %6, align 8                              
  call void @julia_g1_23553(%jl_value_t* %7, %jl_value_t* %11) #0                
  %15 = call %jl_value_t* @jl_gc_alloc_1w()                                      
  %16 = getelementptr inbounds %jl_value_t, %jl_value_t* %15, i64 -1, i32 0      
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %16, align 8                                                                   
  store %jl_value_t* %15, %jl_value_t** %1, align 8                              
  %17 = load i64, i64* inttoptr (i64 139617449926784 to i64*), align 128         
  %18 = bitcast %jl_value_t* %15 to i64*                                         
  store i64 %17, i64* %18, align 16                                              
  store %jl_value_t* %15, %jl_value_t** %1, align 8                              
  %19 = call %jl_value_t* @jl_gc_alloc_1w()                                      
  %20 = getelementptr inbounds %jl_value_t, %jl_value_t* %19, i64 -1, i32 0      
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %20, align 8                                                                   
  store %jl_value_t* %19, %jl_value_t** %6, align 8                              
  %21 = load i64, i64* inttoptr (i64 139617449926832 to i64*), align 16          
  %22 = bitcast %jl_value_t* %19 to i64*                                         
  store i64 %21, i64* %22, align 16                                              
  store %jl_value_t* %19, %jl_value_t** %6, align 8                              
  call void @julia_g1_23553(%jl_value_t* %15, %jl_value_t* %19) #0               
  %23 = load i64, i64* %5, align 8                                               
  store i64 %23, i64* bitcast ([80040 x i8*]* @jl_tls_states to i64*), align 8   
  ret void                                                                       
}                                                                                

define void @julia_f2_23557() #0 {                                             
top:                                                                             
  %0 = alloca [4 x %jl_value_t*], align 8                                        
  %.sub = getelementptr inbounds [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 
0, i64 0                                                                         
  %1 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 0, i64 2    
  %2 = bitcast [4 x %jl_value_t*]* %0 to i64*                                    
  store i64 4, i64* %2, align 8                                                  
  %3 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 0, i64 1    
  %4 = load i64, i64* bitcast ([80040 x i8*]* @jl_tls_states to i64*), align 8   
  %5 = bitcast %jl_value_t** %3 to i64*                                          
  store i64 %4, i64* %5, align 8                                                 
  store %jl_value_t** %.sub, %jl_value_t*** bitcast ([80040 x i8*]* @jl_tls_state
s to %jl_value_t***), align 8                                                    
  store %jl_value_t* null, %jl_value_t** %1, align 8                             
  %6 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %0, i64 0, i64 3    
  store %jl_value_t* null, %jl_value_t** %6, align 8                             
  %7 = call %jl_value_t* @jl_gc_alloc_1w()                                       
  %8 = getelementptr inbounds %jl_value_t, %jl_value_t* %7, i64 -1, i32 0        
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %8, align 8                                                                    
  store %jl_value_t* %7, %jl_value_t** %1, align 8                               
  %9 = load i64, i64* inttoptr (i64 139617449926784 to i64*), align 128          
  %10 = bitcast %jl_value_t* %7 to i64*                                          
  store i64 %9, i64* %10, align 16                                               
  store %jl_value_t* %7, %jl_value_t** %1, align 8                               
  %11 = call %jl_value_t* @jl_gc_alloc_1w()                                      
  %12 = getelementptr inbounds %jl_value_t, %jl_value_t* %11, i64 -1, i32 0      
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %12, align 8                                                                   
  store %jl_value_t* %11, %jl_value_t** %6, align 8                              
  %13 = load i64, i64* inttoptr (i64 139617449926832 to i64*), align 16          
  %14 = bitcast %jl_value_t* %11 to i64*                                         
  store i64 %13, i64* %14, align 16                                              
  store %jl_value_t* %11, %jl_value_t** %6, align 8                              
  %15 = call %jl_value_t* @julia_g2_23558(%jl_value_t* inttoptr (i64 139617449836
992 to %jl_value_t*), %jl_value_t** %1, i32 2)                                   
  %16 = call %jl_value_t* @jl_gc_alloc_1w()                                      
  %17 = getelementptr inbounds %jl_value_t, %jl_value_t* %16, i64 -1, i32 0      
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %17, align 8                                                                   
  store %jl_value_t* %16, %jl_value_t** %1, align 8                              
  %18 = load i64, i64* inttoptr (i64 139617449926784 to i64*), align 128         
  %19 = bitcast %jl_value_t* %16 to i64*                                         
  store i64 %18, i64* %19, align 16                                              
  store %jl_value_t* %16, %jl_value_t** %1, align 8                              
  %20 = call %jl_value_t* @jl_gc_alloc_1w()                                      
  %21 = getelementptr inbounds %jl_value_t, %jl_value_t* %20, i64 -1, i32 0      
  store %jl_value_t* inttoptr (i64 139617450557648 to %jl_value_t*), %jl_value_t*
* %21, align 8                                                                   
  store %jl_value_t* %20, %jl_value_t** %6, align 8                              
  %22 = load i64, i64* inttoptr (i64 139617449926832 to i64*), align 16          
  %23 = bitcast %jl_value_t* %20 to i64*                                         
  store i64 %22, i64* %23, align 16                                              
  store %jl_value_t* %20, %jl_value_t** %6, align 8                              
  %24 = call %jl_value_t* @julia_g2_23558(%jl_value_t* inttoptr (i64 139617449836
992 to %jl_value_t*), %jl_value_t** %1, i32 2)                                   
  %25 = load i64, i64* %5, align 8                                               
  store i64 %25, i64* bitcast ([80040 x i8*]* @jl_tls_states to i64*), align 8   
  ret void                                                                       
}

And on current master it generates 4 (4 temporaries) and 8 (4 temporaries and 2 * 2 jlcall frames without any reuse...)

define void @julia_f1_22963() #0 {
top:
  %0 = alloca [6 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [6 x %jl_value_t*], [6 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %0, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %1, align 8
  %2 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %0, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %2, align 8
  %3 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %3, align 8
  %4 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %0, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %5 = bitcast [6 x %jl_value_t*]* %0 to i64*
  store i64 8, i64* %5, align 8
  %6 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %0, i64 0, i64 1
  %7 = load i64, i64* bitcast (%jl_value_t*** @jl_tls_states to i64*), align 8
  %8 = bitcast %jl_value_t** %6 to i64*
  store i64 %7, i64* %8, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_tls_states, align 8
  %9 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %9, %jl_value_t** %1, align 8
  %10 = getelementptr inbounds %jl_value_t, %jl_value_t* %9, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %10, align 8
  %11 = bitcast %jl_value_t* %9 to i64*
  store i64 1, i64* %11, align 16
  %12 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %12, %jl_value_t** %2, align 8
  %13 = getelementptr inbounds %jl_value_t, %jl_value_t* %12, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %13, align 8
  %14 = bitcast %jl_value_t* %12 to i64*
  store i64 2, i64* %14, align 16
  call void @julia_g1_22964(%jl_value_t* %9, %jl_value_t* %12) #0
  %15 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %15, %jl_value_t** %3, align 8
  %16 = getelementptr inbounds %jl_value_t, %jl_value_t* %15, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %16, align 8
  %17 = bitcast %jl_value_t* %15 to i64*
  store i64 1, i64* %17, align 16
  %18 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %18, %jl_value_t** %4, align 8
  %19 = getelementptr inbounds %jl_value_t, %jl_value_t* %18, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %19, align 8
  %20 = bitcast %jl_value_t* %18 to i64*
  store i64 2, i64* %20, align 16
  call void @julia_g1_22964(%jl_value_t* %15, %jl_value_t* %18) #0
  %21 = load i64, i64* %8, align 8
  store i64 %21, i64* bitcast (%jl_value_t*** @jl_tls_states to i64*), align 8
  ret void
}

julia> @code_llvm f2()

define void @julia_f2_22968() #0 {
top:
  %0 = alloca [10 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 2
  %2 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 6
  %3 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 8
  store %jl_value_t* null, %jl_value_t** %1, align 8
  %4 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %5 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %5, align 8
  %6 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %6, align 8
  store %jl_value_t* null, %jl_value_t** %2, align 8
  %7 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %7, align 8
  store %jl_value_t* null, %jl_value_t** %3, align 8
  %8 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 9
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = bitcast [10 x %jl_value_t*]* %0 to i64*
  store i64 16, i64* %9, align 8
  %10 = getelementptr [10 x %jl_value_t*], [10 x %jl_value_t*]* %0, i64 0, i64 1
  %11 = load i64, i64* bitcast (%jl_value_t*** @jl_tls_states to i64*), align 8
  %12 = bitcast %jl_value_t** %10 to i64*
  store i64 %11, i64* %12, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_tls_states, align 8
  %13 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %13, %jl_value_t** %1, align 8
  %14 = getelementptr inbounds %jl_value_t, %jl_value_t* %13, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %14, align 8
  %15 = bitcast %jl_value_t* %13 to i64*
  store i64 1, i64* %15, align 16
  %16 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %16, %jl_value_t** %4, align 8
  %17 = getelementptr inbounds %jl_value_t, %jl_value_t* %16, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %17, align 8
  %18 = bitcast %jl_value_t* %16 to i64*
  store i64 2, i64* %18, align 16
  store %jl_value_t* %13, %jl_value_t** %3, align 8
  store %jl_value_t* %16, %jl_value_t** %8, align 8
  %19 = call %jl_value_t* @julia_g2_22969(%jl_value_t* inttoptr (i64 139780292051200 to %jl_value_t*), %jl_value_t** %3, i32 2)
  %20 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %20, %jl_value_t** %5, align 8
  %21 = getelementptr inbounds %jl_value_t, %jl_value_t* %20, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %21, align 8
  %22 = bitcast %jl_value_t* %20 to i64*
  store i64 1, i64* %22, align 16
  %23 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %23, %jl_value_t** %6, align 8
  %24 = getelementptr inbounds %jl_value_t, %jl_value_t* %23, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %24, align 8
  %25 = bitcast %jl_value_t* %23 to i64*
  store i64 2, i64* %25, align 16
  store %jl_value_t* %20, %jl_value_t** %2, align 8
  store %jl_value_t* %23, %jl_value_t** %7, align 8
  %26 = call %jl_value_t* @julia_g2_22969(%jl_value_t* inttoptr (i64 139780292051200 to %jl_value_t*), %jl_value_t** %2, i32 2)
  %27 = load i64, i64* %12, align 8
  store i64 %27, i64* bitcast (%jl_value_t*** @jl_tls_states to i64*), align 8
  ret void
}

Adding a useless branch that can be constant folded by llvm (but not type inference) causes the jlcall buffer to be reused but not the temporaries.

julia> function f3()
           g2(Ref(1), Ref(2))
           1 == 1 ? 1 : 1
           g2(Ref(1), Ref(2))
       end
f3 (generic function with 1 method)
define void @julia_f3_22983() #0 {
top:
  %0 = alloca [8 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 2
  %2 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 6
  store %jl_value_t* null, %jl_value_t** %1, align 8
  %3 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %3, align 8
  %4 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %5 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %5, align 8
  store %jl_value_t* null, %jl_value_t** %2, align 8
  %6 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %7 = bitcast [8 x %jl_value_t*]* %0 to i64*
  store i64 12, i64* %7, align 8
  %8 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %0, i64 0, i64 1
  %9 = load i64, i64* bitcast (%jl_value_t*** @jl_tls_states to i64*), align 8
  %10 = bitcast %jl_value_t** %8 to i64*
  store i64 %9, i64* %10, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_tls_states, align 8
  %11 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %11, %jl_value_t** %1, align 8
  %12 = getelementptr inbounds %jl_value_t, %jl_value_t* %11, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %12, align 8
  %13 = bitcast %jl_value_t* %11 to i64*
  store i64 1, i64* %13, align 16
  %14 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %14, %jl_value_t** %3, align 8
  %15 = getelementptr inbounds %jl_value_t, %jl_value_t* %14, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %15, align 8
  %16 = bitcast %jl_value_t* %14 to i64*
  store i64 2, i64* %16, align 16
  store %jl_value_t* %11, %jl_value_t** %2, align 8
  store %jl_value_t* %14, %jl_value_t** %6, align 8
  %17 = call %jl_value_t* @julia_g2_22969(%jl_value_t* inttoptr (i64 139780292051200 to %jl_value_t*), %jl_value_t** %2, i32 2)
  %18 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %18, %jl_value_t** %4, align 8
  %19 = getelementptr inbounds %jl_value_t, %jl_value_t* %18, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %19, align 8
  %20 = bitcast %jl_value_t* %18 to i64*
  store i64 1, i64* %20, align 16
  %21 = call %jl_value_t* @jl_gc_alloc_1w()
  store %jl_value_t* %21, %jl_value_t** %5, align 8
  %22 = getelementptr inbounds %jl_value_t, %jl_value_t* %21, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139780294998144 to %jl_value_t*), %jl_value_t** %22, align 8
  %23 = bitcast %jl_value_t* %21 to i64*
  store i64 2, i64* %23, align 16
  store %jl_value_t* %18, %jl_value_t** %2, align 8
  store %jl_value_t* %21, %jl_value_t** %6, align 8
  %24 = call %jl_value_t* @julia_g2_22969(%jl_value_t* inttoptr (i64 139780292051200 to %jl_value_t*), %jl_value_t** %2, i32 2)
  %25 = load i64, i64* %10, align 8
  store i64 %25, i64* bitcast (%jl_value_t*** @jl_tls_states to i64*), align 8
  ret void
}
@yuyichao yuyichao added regression Regression in behavior compared to a previous version compiler:codegen Generation of LLVM IR and native code labels Mar 5, 2016
@StefanKarpinski StefanKarpinski added performance Must go faster GC Garbage collector labels Mar 6, 2016
@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2016

Example from #15717

function f_simd(n::Integer)
    a = zeros(Float32, n)
    @inbounds @simd for i in eachindex(a)
        a[i] += 1
    end
    a
end

Vectorization works with normal opt level (on llvm 3.8 at least) but the function still generate 8 roots instead of 1 (for a).

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2016

Code from #15457

julia> f2() = ccall(:jl_get_cpu_name, Ref{String}, ())
f2 (generic function with 1 method)
define %jl_value_t* @julia_f2_49667() #0 !dbg !6 {
top:
  %0 = call %jl_value_t*** @jl_get_ptls_states()
  %1 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %1, i64 0, i64 0
  %2 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %1, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %2, align 8
  %3 = bitcast [3 x %jl_value_t*]* %1 to i64*
  store i64 2, i64* %3, align 8
  %4 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %1, i64 0, i64 1
  %5 = bitcast %jl_value_t*** %0 to i64*
  %6 = load i64, i64* %5, align 8
  %7 = bitcast %jl_value_t** %4 to i64*
  store i64 %6, i64* %7, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %0, align 8
  %8 = call %jl_value_t* inttoptr (i64 140532991692000 to %jl_value_t* ()*)()
  store %jl_value_t* %8, %jl_value_t** %2, align 8
  %9 = load i64, i64* %7, align 8
  store i64 %9, i64* %5, align 8
  ret %jl_value_t* %8
}

Should have no GC root instead of 1.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2016

Move from #15402

immutable Wrap{T}
  x::T
end

function f(a::Wrap)
  s = 0.0
  for i = 1:length(a.x)
    @inbounds s += a.x[i]
  end
  s
end
define double @julia_f_49649(%jl_value_t*) #0 !dbg !6 {
top:
  %1 = call %jl_value_t*** @jl_get_ptls_states()
  %2 = alloca [4 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %3, align 8
  %4 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %5 = bitcast [4 x %jl_value_t*]* %2 to i64*
  store i64 4, i64* %5, align 8
  %6 = getelementptr [4 x %jl_value_t*], [4 x %jl_value_t*]* %2, i64 0, i64 1
  %7 = bitcast %jl_value_t*** %1 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = bitcast %jl_value_t** %6 to i64*
  store i64 %8, i64* %9, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %1, align 8
  %10 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 0, i32 0
  %11 = load %jl_value_t*, %jl_value_t** %10, align 8
  store %jl_value_t* %11, %jl_value_t** %3, align 8
  %12 = getelementptr inbounds %jl_value_t, %jl_value_t* %11, i64 1
  %13 = bitcast %jl_value_t* %12 to i64*
  %14 = load i64, i64* %13, align 8
  %15 = icmp slt i64 %14, 1
  br i1 %15, label %L2, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %16 = bitcast %jl_value_t* %11 to double**
  %17 = load double*, double** %16, align 8
  br label %if

L.L2_crit_edge:                                   ; preds = %if
  store %jl_value_t* %11, %jl_value_t** %4, align 8
  br label %L2

L2:                                               ; preds = %L.L2_crit_edge, %top
  %s.0.lcssa = phi double [ %23, %L.L2_crit_edge ], [ 0.000000e+00, %top ]
  %18 = load i64, i64* %9, align 8
  store i64 %18, i64* %7, align 8
  ret double %s.0.lcssa

if:                                               ; preds = %if.lr.ph, %if
  %s.04 = phi double [ 0.000000e+00, %if.lr.ph ], [ %23, %if ]
  %"#temp#.03" = phi i64 [ 1, %if.lr.ph ], [ %19, %if ]
  %19 = add i64 %"#temp#.03", 1
  %20 = add i64 %"#temp#.03", -1
  %21 = getelementptr double, double* %17, i64 %20
  %22 = load double, double* %21, align 8
  %23 = fadd double %s.04, %22
  %24 = icmp eq i64 %"#temp#.03", %14
  br i1 %24, label %L.L2_crit_edge, label %if
}

The inner loop is not affected anymore (and if the type is changed to integer or adding @simd vectorization works too). The function still generate GC root where it is unnecessary.

@tkelman
Copy link
Contributor

tkelman commented Jun 9, 2016

will these all likely have the same eventual fix?

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2016

It depends. Once someone starts to work on this, we can make a check list of the cases in this issue or open new ones if necessary.

@yuyichao
Copy link
Contributor Author

Given that @Keno's work should be a fix-all solution we can stick to the plan of using this issue to track all inefficient test cases. So copying the case from #20981 over and closing that one.

julia> type foo
           a::Vector{Int64}
       end

julia> 

julia> f(x::foo) = (a = x.a; for i = 1:100; @inbounds a[i] += 1; end)
f (generic function with 1 method)

julia> 

julia> code_llvm(f, Tuple{foo})

define void @julia_f_60823(i8** dereferenceable(8)) #0 !dbg !5 {
top:
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-10928, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #3
  %ptls = bitcast i8* %ptls_i8 to i8****
  %1 = alloca [5 x i8**], align 8
  %.sub = getelementptr inbounds [5 x i8**], [5 x i8**]* %1, i64 0, i64 0
  %2 = getelementptr inbounds [5 x i8**], [5 x i8**]* %1, i64 0, i64 2
  %3 = getelementptr inbounds [5 x i8**], [5 x i8**]* %1, i64 0, i64 3
  %4 = getelementptr inbounds [5 x i8**], [5 x i8**]* %1, i64 0, i64 4
  %5 = bitcast [5 x i8**]* %1 to i64*
  %6 = bitcast i8*** %2 to i8*
  call void @llvm.memset.p0i8.i64(i8* %6, i8 0, i64 24, i32 8, i1 false)
  store i64 6, i64* %5, align 8
  %7 = getelementptr inbounds [5 x i8**], [5 x i8**]* %1, i64 0, i64 1
  %8 = bitcast i8* %ptls_i8 to i64*
  %9 = load i64, i64* %8, align 8
  %10 = bitcast i8*** %7 to i64*
  store i64 %9, i64* %10, align 8
  store i8*** %.sub, i8**** %ptls, align 8
  %11 = bitcast i8** %0 to i8***
  %12 = load i8**, i8*** %11, align 8
  store i8** %12, i8*** %2, align 8
  %13 = bitcast i8** %12 to i64**
  %14 = load i64*, i64** %13, align 8
  br i1 false, label %scalar.ph, label %min.iters.checked

min.iters.checked:                                ; preds = %top
  br i1 false, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %min.iters.checked
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = or i64 %index, 1
  %15 = add nsw i64 %offset.idx, -1
  %16 = getelementptr i64, i64* %14, i64 %15
  %17 = bitcast i64* %16 to <4 x i64>*
  %wide.load = load <4 x i64>, <4 x i64>* %17, align 8
  %18 = add <4 x i64> %wide.load, <i64 1, i64 1, i64 1, i64 1>
  %19 = bitcast i64* %16 to <4 x i64>*
  store <4 x i64> %18, <4 x i64>* %19, align 8
  %index.next = add i64 %index, 4
  %20 = icmp eq i64 %index.next, 100
  br i1 %20, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  br i1 true, label %L17, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %min.iters.checked, %top
  br label %if

if:                                               ; preds = %scalar.ph, %if
  br i1 undef, label %L17, label %if

L17:                                              ; preds = %middle.block, %if
  store i8** %12, i8*** %3, align 8
  store i8** %12, i8*** %4, align 8
  %21 = load i64, i64* %10, align 8
  store i64 %21, i64* %8, align 8
  ret void
}

@StefanKarpinski
Copy link
Member

Given that @Keno's work should be a fix-all solution we can stick to the plan of using this issue to track all inefficient test cases.

Is the idea that his PR should fix all of these in one fell swoop, so we just collect them here to make sure they're all actually fixed?

@Keno
Copy link
Member

Keno commented Jun 17, 2017

It's a pretty general solution. All remaining problems should have their own issues.

@tkelman
Copy link
Contributor

tkelman commented Jun 17, 2017

can we regression test these somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code GC Garbage collector performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants