Segfault when creating a Vec{7, Float64} #1

KristofferC · 2016-02-03T08:22:40Z

First, please say if you don't think this package is ready for any issues.

I always get a segfault when loading a length 7 Vec from a Float64 array.

using SIMD
a = rand(10)

vload(Vec{4, Float64}, a, 1)
#4-element SIMD.Vec{4,Float64}:
#Float64<0.786751,0.266775,0.411642,0.732678>

vload(Vec{5, Float64}, a, 1)
#5-element SIMD.Vec{5,Float64}:
#Float64<0.786751,0.266775,0.411642,0.732678,0.17212>

vload(Vec{6, Float64}, a, 1)
#6-element SIMD.Vec{6,Float64}:
#Float64<0.786751,0.266775,0.411642,0.732678,0.17212,0.524186>

vload(Vec{7, Float64}, a, 1)

#signal (11): Segmentation fault
#while loading no file, in expression starting on line 0
#unknown function (ip: 0x7f91e836f5f9)
#unknown function (ip: 0x32de7c0)
#Segmentation fault (core dumped)

Also, is ArchRobinsons recent patch to LLVM needed for this package to properly vectorize?

The text was updated successfully, but these errors were encountered:

eschnett · 2016-02-04T23:06:43Z

@KristofferC The package is certainly ready for issues. I'm just not sure how much I can do in this case.

I can reproduce the problem locally. This seems to be a segfault in LLVM (although technically, it might also be Julia's fault). The input I'm giving to LLVM is a string, containing instructions in the LLVM language http://llvm.org/docs/LangRef.html. I would expect that LLVM either reports an error, or knows how to interpret the instruction. Here, you seem to have come across a bug in LLVM.

Most CPU architectures support native vector widths that are a power of two. Thus, 7 is a strange number, and that might explain why this hasn't been corrected yet. LLVM should (depending on your CPU) translate this into 4 vectors of size 2, or into 2 vectors of size 4, where one vector lane remains unused.

I don't know yet what a good work-around would be. Can you restrict your code to vector sizes that are powers of 2? What if you introduce "unused" values? Or maybe the SIMD package should do this.

I would also like to submit a bug report to LLVM. For this I'm going to need the LLVM code that Julia actually passes to LLVM (which is different from what SIMD passes to Julia). I'll have to investigate how to get this.

No, ArchRobinson's patch is not necessary to generate vectorized code. It does, however, improve code quality (and presumably performance) in certain cases. I plan to submit a PR to Julia with this patch.

KristofferC · 2016-02-05T11:01:48Z

I don't know if it helps but just calling @code_native vload(Vec{7, Float64}, a, 1) segfaults which I guess means that it isn't the execution of the code that is causing the segfault.

KristofferC · 2016-02-05T11:04:23Z

This is from GDB...

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6ecca29 in llvm::DAGTypeLegalizer::GenWidenVectorLoads(llvm::SmallVectorImpl<llvm::SDValue>&, llvm::LoadSDNode*) () from /home/kristoffer/julia/usr/bin/../lib/libjulia.so

KristofferC · 2016-02-05T13:40:24Z

Just fyi, I don't get the example to vectorize on my julia installation.

The generated LLVM is: https://gist.github.com/KristofferC/3561adc06fd328812293

Looks like one problem is the type stability of vload

julia> @code_warntype vadd!(a, b)
Variables:
  #self#::#vadd!
  xs::Array{Float32,1}
  ys::Array{Float32,1}
  N::Int64
  #s22::Int64
  i::Int64
  yv::Any
  xv::Any

eschnett · 2016-02-05T18:24:09Z

The segfaults happens in LLVM, as it is translating the LLVM code to machine code. This points to a bug in LLVM. The function name GenWidenVectorLoads indicates that it tries to widen the vector width, presumably from 7 to 8.

In your statement @code_llvm vadd!(a, b), how are a and b defined? If they don't have the right types, then @code_llvm won't be able to choose the correct function. You are probably seeing a very generic function that first checks the types of a and b, and then dispatches. This has nothing to do with vload.

Here is how you use @code_llvm (or @code_native, which I would prefer in this case):

julia> using SIMD

julia> function vadd!{N,T}(xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}})
           @assert length(ys) == length(xs)
           @assert length(xs) % N == 0
           @inbounds for i in 1:N:length(xs)
               xv = vload(Vec{N,T}, xs, i)
               yv = vload(Vec{N,T}, ys, i)
               xv += yv
               vstore(xv, xs, i)
           end
       end
vadd! (generic function with 1 method)

julia> @code_llvm vadd!(Vector{Float64}(), Vector{Float64}(), Vec{4,Float64})

julia> @code_native vadd!(Vector{Float64}(), Vector{Float64}(), Vec{4,Float64})

In the output of @code_llvm, you will see a basic block such as

if30:                                             ; preds = %L10
  %54 = add i64 %"#s308.0", %41
  %55 = load i64, i64* %50, align 8
  %56 = shl i64 %"#s308.0", 3
  %57 = add i64 %56, -8
  %58 = add i64 %57, %55
  %ptr.i = inttoptr i64 %58 to <4 x double>*
  %res.i = load <4 x double>, <4 x double>* %ptr.i, align 8
  %59 = load i64, i64* %51, align 8
  %60 = add i64 %57, %59
  %ptr.i.39 = inttoptr i64 %60 to <4 x double>*
  %res.i.40 = load <4 x double>, <4 x double>* %ptr.i.39, align 8
  %res.i.49 = fadd <4 x double> %res.i, %res.i.40
  store <4 x double> %res.i.49, <4 x double>* %ptr.i, align 8
  %61 = icmp eq i64 %54, %47
  br i1 %61, label %L16.loopexit, label %L4

where the <4 x double> indicate vector instructions.

Similarly, in the output of @code_native, you might see code such as

Source line: 1185
L208:
    cmpq    $1, %rsi
    jl  L296
    movq    8(%rbx), %rdx
    addq    $-3, %rdx
    cmpq    %rdx, %rsi
    jg  L296
Source line: 27
    addq    %r9, %rsi
Source line: 458
    movq    (%rbx), %rdx
Source line: 1090
    vmovupd (%rax,%rdx), %ymm0
Source line: 458
    movq    (%r12), %rcx
Source line: 565
    vaddpd  (%rax,%rcx), %ymm0, %ymm0
Source line: 1176
    vmovupd %ymm0, (%rax,%rdx)
    addq    %rdi, %rax
    cmpq    %rsi, %r8
    jne L208

where the vaddpd is the translation of the fadd <4 x double> above.

KristofferC · 2016-02-05T18:38:55Z

Sorry, my a and b was both Vector{Float32} and this is correctly inferred (as can be seen from @code_warntype:

  xs::Array{Float32,1}
  ys::Array{Float32,1}

)

I tried with Vector{Float64} as well and I still did not see any vector blocks or vector instructions in the LLVM code nor any packed instructions in the assembly.

I however only tried with the exact function from the README. I will try your version where you specify the Type of the Vec in the function argument tomorrow.

eschnett · 2016-02-05T20:29:28Z

I see. I have some local changes in my repository -- maybe they corrected an error that I didn't see.

KristofferC · 2016-02-06T20:06:11Z

With the function you just posted I get the vectorized instructions. With the one in the README, I don't (because of some type instability indicated by @code_warntype I presume).

KristofferC · 2016-02-06T20:54:05Z

If you are interesting, I did some battle testing with SIMD.jl: JuliaDiff/ForwardDiff.jl#98 (comment)

The generated code contains packed instructions but is unfortunately a factor 2x slower than the old NTuple code. It might have to do with my implementations at JuliaDiff/ForwardDiff.jl@3fcd796#diff-8e9d15484d2c994e12d78a9a7c2c8201R122

It seems that multiplication of a Vec with a scalar returns an Array which is why I used the SIMD.create functions.

julia> Vec{3,Float32}((1.0,2.0,2.0))*3
3-element Array{Float32,1}:
 3.0
 6.0
 6.0

eschnett · 2016-02-06T21:01:35Z

Thanks for pointing me to the README; I was looking at the test suite, which contains a similar function. Yes, the function in the README is too clever for Julia's type inference mechanism. I have corrected it.

eschnett · 2016-02-06T21:02:57Z

Mixed scalar/vector operations are not yet implemented. It is currently necessary to explicitly convert the scalar to a vector first. In your case, Julia apparently realizes that a Vec is iterable, and then iterates over its values, multiplying by a scalar, and collecting the result in an array. Surprising, and not useful.

I will add the missing type promotion rules.

eschnett · 2016-02-06T21:05:52Z

Also: Kudos for moving into benchmarking. I'm still at the "getting things to work" stage, albeit with masked vector load/store operations that are probably not relevant to you.

KristofferC · 2016-02-07T11:24:04Z

I looked a bit closer at the generated code for the function:

function simd_sum{T}(x::Vector{T})
    s = zero(T)
    @simd for i in eachindex(x)
        @inbounds s = s + x[i]
    end
    return s
end

The T here is either of a type similar to:

immutable Ttuple{N, T}
    val::T
    partials::NTuple{N, T}
end

or

immutable TVec{N, T}
    val::T
    partials::Vec{N, T}
end

where + has been defined similar to

+(a::TVec, b::Tvec) = TVec(a.val + b.val, a.partials + b.partials)

I ran the code with N = 4, T = Float64. For the NTuple version there is simply 5 adds:

vaddsd  8(%rcx), %xmm2, %xmm2
vaddsd  16(%rcx), %xmm0, %xmm0
vaddsd  24(%rcx), %xmm4, %xmm4
vaddsd  32(%rcx), %xmm3, %xmm3
vaddsd  (%rcx), %xmm1, %xmm1

while the Vec version does

    vunpcklpd   %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
    vunpcklpd   %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm3, %ymm1, %ymm1
    vmovupd 8(%rcx), %xmm2
    vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
    vaddpd  %ymm2, %ymm1, %ymm1
    vpermilpd   $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
    vextractf128    $1, %ymm1, %xmm3
    vpermilpd   $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
    vaddsd  (%rcx), %xmm0, %xmm0

I am not good at all at assembly but it seems to arrange the data into the AVX registers ymm1 and ymm2 and then does a packed add on these values and then finish off with one single add for the val field.

Unfortunately in this case it seems to be about 2 * slower than the NTuple version.

eschnett · 2016-02-07T15:03:48Z

@KristofferC The kind of code you show improved very much for me when using Arch Robison's second patch. Without the patch, the packing and unpacking you see makes vectorization useless. Are you using the patch?

KristofferC · 2016-02-07T15:17:00Z

No, I am not. I will try with it! Thanks for being so responsive :)

@ArchRobison

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.

@ArchRobison

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.

KristofferC · 2016-02-07T16:42:22Z

Hm, I did a new clone of julia, checked out your recent PR and built everything and I still get the packing / unpacking into the scalar registers.

If you want to reproduce:

function simd_sum{T}(x::Vector{T})
   s = SIMD.create(T, 0.0)
   @simd for i in eachindex(x)
       @inbounds s = s + x[i]
   end
   return s
end

vec = [Vec{4, Float64}((rand(), rand(), rand(), rand())) for i in 1:500]

@code_native simd_sum(vec)

Edit: Changed the repo so that it only uses SIMD.jl and not ForwardDiff.

eschnett · 2016-02-07T17:37:44Z

The LLVM code for the loop kernel is:

L:                                                ; preds = %L.preheader, %L
  %s.sroa.16.0 = phi double [ %res_va_elem3.i, %L ], [ 0.000000e+00, %L.preheader ]
  %s.sroa.12.0 = phi double [ %res_va_elem2.i, %L ], [ 0.000000e+00, %L.preheader ]
  %s.sroa.8.0 = phi double [ %res_va_elem1.i, %L ], [ 0.000000e+00, %L.preheader ]
  %s.sroa.4.0 = phi double [ %res_va_elem0.i, %L ], [ 0.000000e+00, %L.preheader ]
  %s.sroa.0.0 = phi double [ %17, %L ], [ 0.000000e+00, %L.preheader ]
  %"#s5.0" = phi i64 [ %8, %L ], [ 1, %L.preheader ]
  %8 = add i64 %"#s5.0", 1
  %9 = add i64 %"#s5.0", -1
  %10 = getelementptr %GradientNumber, %GradientNumber* %7, i64 %9
  %11 = load %GradientNumber, %GradientNumber* %10, align 8
  %12 = extractvalue %GradientNumber %11, 0
  %13 = extractvalue %GradientNumber %11, 1, 0, 0, 0
  %14 = extractvalue %GradientNumber %11, 1, 0, 0, 1
  %15 = extractvalue %GradientNumber %11, 1, 0, 0, 2
  %16 = extractvalue %GradientNumber %11, 1, 0, 0, 3
  %arg1_iter0.i = insertelement <4 x double> undef, double %s.sroa.4.0, i32 0
  %arg1_iter1.i = insertelement <4 x double> %arg1_iter0.i, double %s.sroa.8.0, i32 1
  %arg1_iter2.i = insertelement <4 x double> %arg1_iter1.i, double %s.sroa.12.0, i32 2
  %arg1.i = insertelement <4 x double> %arg1_iter2.i, double %s.sroa.16.0, i32 3
  %arg2_iter0.i = insertelement <4 x double> undef, double %13, i32 0
  %arg2_iter1.i = insertelement <4 x double> %arg2_iter0.i, double %14, i32 1
  %arg2_iter2.i = insertelement <4 x double> %arg2_iter1.i, double %15, i32 2
  %arg2.i = insertelement <4 x double> %arg2_iter2.i, double %16, i32 3
  %res.i = fadd <4 x double> %arg1.i, %arg2.i
  %res_va_elem0.i = extractelement <4 x double> %res.i, i32 0
  %res_va_elem1.i = extractelement <4 x double> %res.i, i32 1
  %res_va_elem2.i = extractelement <4 x double> %res.i, i32 2
  %res_va_elem3.i = extractelement <4 x double> %res.i, i32 3
  %17 = fadd double %s.sroa.0.0, %12
  %18 = icmp eq i64 %"#s5.0", %4
  br i1 %18, label %L2.loopexit, label %L

The code looks reasonable, apart from the usual vector/array conversions that Arch Robison's patch should delete during code generation. Compared to other loops, the vector is here nested inside another structure, hence the multi-layer extractvalue statements.

I believe we need a stronger version of @ArchRobison's D14260 patch.

An alternative (I don't know how difficult to implement, though) would be to explicitly add LLVM's vector types to Julia, e.g. as a new version of NTuple or bitstype.

KristofferC · 2016-02-07T17:45:44Z

I edited my post above to only include a raw vector of Vecs. and I still get the

vunpcklpd   %xmm3, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm3[0]
vunpcklpd   %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0]
vinsertf128 $1, %xmm2, %ymm0, %ymm0
vmovupd (%rcx), %xmm1
vinsertf128 $1, 16(%rcx), %ymm1, %ymm1
vaddpd  %ymm1, %ymm0, %ymm0
vpermilpd   $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vextractf128    $1, %ymm0, %xmm2
vpermilpd   $1, %xmm2, %xmm3 # xmm3 = xmm2[1,0]

Do you get the same?

Maybe it is the same thing since the Vec is inside a Vector.

eschnett · 2016-02-08T02:45:18Z

For your edited post I get

Source line: 574
L64:
    vaddpd  (%rcx), %ymm0, %ymm0
Source line: 75
    addq    $32, %rcx
    addq    $-1, %rax
    jne L64

In the future, can you just create a new post instead of editing an old one? This makes it possible for me to look at your old example again, which is now kind of difficult for me since I didn't save it.

eschnett · 2016-02-08T02:46:06Z

To reiterate: I have D14260 applied to my LLVM tree.

@ArchRobison

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.

KristofferC · 2016-02-08T08:24:10Z

In the future, can you just create a new post instead of editing an old one?

Yes, of course. My original example was:

Pkg.checkout("ForwardDiff", "kc/simd")
using ForwardDiff
using SIMD

function simd_sum{T}(x::Vector{T})
    s = zero(T)
    @simd for i in eachindex(x)
        @inbounds s = s + x[i]
    end
    return s
end

vec = [ForwardDiff.GradientNumber{4, Float64, Vec{4, Float64}}(1.0, ForwardDiff.Partials(Vec{4, Float64}((rand(), rand(), rand(), rand())))) for i in 1:500]

@code_native simd_sum(vec)

Maybe the patch didn't get applied for me. I will look into it further,

KristofferC · 2016-02-08T10:06:00Z

Yes, I failed somehow and the patch didn't apply at my last test. Now it does and for the Vec array I get beautifully generated code but not for the GradientNumber case. At least now we are at the same point :)

Hm, maybe Archs patch can be tweaked for the GradientNumber example to work but it seems a bit fragile. I wonder how much work it would be to add the LLVM vector in julia and if that would make things easier.

eschnett · 2016-02-08T15:40:53Z

Patch D14260 is now part of Julia'a master branch.

Yes, I am also worried about the fragility here.

ArchRobison · 2016-02-08T15:43:59Z

I'll take a look at GradientNumber example.

ArchRobison · 2016-02-09T21:29:56Z

What do I need to do to reproduce the GradientNumber example? I got stuck at the first step:

julia> Pkg.checkout("ForwardDiff", "kc/simd")
ERROR: Base.Pkg.PkgError("ForwardDiff is not a git repo")

KristofferC · 2016-02-09T21:37:32Z

You need to Pkg.add("ForwardDiff") first I believe.

KristofferC · 2016-02-09T21:51:26Z

If you want to get rid of the external ForwardDiff dependency you can use this which should generate the same code (might need to update SIMD.jl):

using SIMD

immutable GradientNumber{N, T}
    val::T
    partials::Vec{N,T}
end

Base.zero{N,T}(::Type{GradientNumber{N, T}}) = GradientNumber(zero(T), zero(Vec{N,T}))

function Base.(:+){N,T}(g1::GradientNumber{N, T}, g2::GradientNumber{N, T})
    return GradientNumber(g1.val + g2.val, g1.partials + g2.partials)
end

function simd_sum{T}(x::Vector{T})
    s = zero(T)
    @simd for i in eachindex(x)
        @inbounds s = s + x[i]
    end
    return s
end

code_llvm(simd_sum, (Vector{GradientNumber{4, Float64}},))
code_native(simd_sum, (Vector{GradientNumber{4, Float64}},))

ArchRobison · 2016-02-09T21:54:06Z

Thanks. Examples with fewer dependencies always appreciated. Now I'm stuck on:

julia> using SIMD
ERROR: ArgumentError: SIMD not found in path.

I also tried:

julia> Pkg.add("SIMD")
ERROR: Base.Pkg.PkgError("unknown package SIMD")

Evidently I'm missing some context. I'm working off Julia sources pulled on Monday.

KristofferC · 2016-02-09T21:58:53Z

Sorry, I should have been more thorough. You need this repo that we are talking in but it is not registered, that is Pkg.clone("https://github.com/eschnett/SIMD.jl")

KristofferC · 2016-02-09T22:10:20Z

For some more context, the problem is that the code I just posted above have the unnecessary packing and unpacking into xmm registers while a Vector of only Vecs have good code, c.f:

code_native(simd_sum, (Vector{Vec{4, Float64}},))

ArchRobison · 2016-02-09T23:28:58Z

Using Pkg.clone("https://github.com/eschnett/SIMD.jl") worked (i.e. no trailing slash in URL).

Your example is nice because it exposes several issues where the current pattern matching designed to help SLP vectorizer deal with Julia tuples is not enough to handle what's coming out of SIMD.jl. There's somewhat a design mismatch in that SLP vectorizer (and its assists) presume that its dealing with scalar code, whereas SIMD.jl is partially vectorizing code. I'm wondering whether if completely relying on SLP vectorizer (with patch D14185 applied would vectorize the code, or get struck too. There's both phi-function and type-nesting trickiness to deal with.

Maybe for packages already engaging in llvmcall machinations it would make sense to have a way to tell the compiler to use a LLVM vector. But I'm not ready to give up quite yet.

eschnett · 2016-02-18T15:36:08Z

@ArchRobison Did you have time to look into this issue?

I've looked at Julia to see where the decision regarding LLVM structures, arrays, and vectors is made. I can think of two ways to introduce LLVM vectors to Julia:

Introduce a new type (similar to tuples) that is always translated to a vector, based on the type's name
Somehow "flag" certain types, and if so, translate them to a vector instead of an array.

I don't know how difficult it would be to introduce a new tuple-like type to Julia. It seems overkill.

I don't know how robustly one can "flag" a type in Julia. For testing, looking at the type's name would suffice for me.

I think LLVM vectors are supposed to be more strictly aligned than structures or arrays. In the code I generated, I can circumvent this by specifying a weaker alignment. Is there other code that is also relevant? For example, would Julia automatically generate code to access or copy these vectors that then might break due to alignment restrictions?

Any advice you have would be help.

ArchRobison · 2016-02-18T20:11:04Z

I pondered it a bit. If LLVM had a cast operation from vector to aggregate types, and vice-versa, we wouldn't have a problem. Adding LLVM intrinsics for the casts might work, but that's a pain to maintain for each target machine.

Lacking a cast, my approach had been to recognize the idioms that perform such casts, e.g. extracting each element and inserting it into an aggregate. The trouble with the approach is that it depends on being able to rewrite the idiom. In the context of load or store, that's easy -- just do pointer type punning. But the GradientNumber example points out that there are two other important contexts:

phi nodes -- these seem to be doable by making my existing patch look through phi nodes.
insertvalue/extractvalue -- this is where I'm stuck. If an extracted value is an aggregate, and needs to be converted to a vector, there's no obvious way to do so in LLVM other than the component-by-component extract/insert sequence, or via a store/load sequence.

Looking back though the definition chains for where an outer aggregate was loaded from won't help either, because type-punning the outer aggregate to a different aggregate type won't always work, since changing a component from aggregate to vector might change it's alignment. E.g., GradientFloat{4,Float32} occupies 20 bytes if represented as Julia currently does, but would occupy 32 bytes if an LLVM vector is used for the tuple (i.e. 4 bytes for val; 12 bytes for pad; 16 bytes for partials). We could consider breaking said load up into component loads when emitting the LLVM code. But I worry that the trickiness seems counter to the goal of "do what I said" vectorization.

I would dread complicating the fundamental Julia type system for this use case, both from the viewpoint of teaching users and complicating type inference. The flagged type approach seems worth examining. The flagged type needs a name (because just using a tuple won't do since it could be confused with ordinary tuples). At the Julia level, the flagged type could be an immutable datatype with a single field that is a homogeneous tuple. That gives it a name to distinguish it while staying within the current Julia type system. The flag would indicate that the field should be mapped to an LLVM vector. Generation of LLVM code gets trickier, but I had most of that figured out for an earlier patch that I had abandoned. The flag would need to live in _jl_datatype_t.

I'm not sure what the best approach for the Julia syntax would be. Maybe have an intrinsic type constructor? Or a macro that operates on a stylized immutable declaration?

eschnett · 2016-02-19T18:48:24Z

I experimented with a setup such as you suggested in https://github.com/eschnett/julia/tree/eschnett/vector-types. I'm simply keying the flag on the type name (Vec2) at the moment.

I tried putting the flag into the surrounding immutable type, not into the contained tuple type. Is this a good idea? This means that I need to pass a flag createvector around when calling julia_type_to_llvm and julia_struct_to_llvm.

Also, I didn't manage to observe any change in behaviour. I would have expected Julia to access the tuple as <4 * double> instead of [4 * double], or that LLVM reports an error when Julia uses [4 * double], but code generation didn't change at all. Maybe Julia casts between pointer types, and I need to update code generation as well, as you hint above.

I didn't look at struct layout, offsets, or alignments either.

ArchRobison · 2016-02-19T20:58:59Z

I was initially thinking the flag should go on the surrounding immutable type, but as you point out, that necessitates passing the flag as a parameter in places. (I ended up in the same boat when trying to automatically lay out some tuples as LLVM vectors.) I was worried about putting the flag on the tuple type itself because then we would have two tuple types that differ only by the hidden flag and confuse reflection. But maybe that works out right? They are different types and the internals of Vec are essentially private anyway.

I recall there is lots of pointer type-punning in the generation of LLVM code, so it's possible the punning undid your changes. (LLVM is moving from typed pointers to typed loads/stores anyway; then the punning will be free.)

I'm wondering about trying an inverted flagging scheme, like this:

immutable VectorElement{T}  # The flagged type.  Could be flagged by name.
    x::T
end

typealias Vec{N,T} NTuple{N,VectorElement{T}}

The inversion would give us kosher Julia types that are clearly distinguished from ordinary tuples. I think it would avoid the flag-passing hassle. We would need to modify julia_type_to_llvm and julia_struct_to_llvm to do two ad-hoc mappings:

Map a VectorElement{T} onto a T.
Map a homogeneous tuple of VectorElement{T} onto an LLVM vector of T
We'd also have to make emit_getfield and emit_new_struct deal with the first ad-hoc mapping.

ArchRobison · 2016-04-28T16:55:39Z

FYI, the LLVM patches (D14185 and D14260) were finally committed to LLVM trunk, but are probably less interesting now that we're going the tuple of vecelement route.

Update README.md

eschnett mentioned this issue Feb 7, 2016

Add LLVM patch D14260 to improve SIMD code JuliaLang/julia#14976

Merged

KristofferC mentioned this issue Feb 9, 2016

Giant rewrite for Julia v0.5 JuliaDiff/ForwardDiff.jl#102

Merged

5 tasks

ArchRobison mentioned this issue Feb 25, 2016

Inverted marking of vector tuple type JuliaLang/julia#15244

Merged

KristofferC closed this as completed Apr 7, 2016

eschnett pushed a commit that referenced this issue Aug 1, 2019

Merge pull request #1 from nlw0/nlw0-patch-1

86c895f

Update README.md

KristofferC mentioned this issue Jan 23, 2020

llvmcall errors for some Vec sizes #61

Closed

eschnett mentioned this issue Jan 23, 2020

jl_special_vector_alignment: Remove work-around for old LLVM versions JuliaLang/julia#34490

Merged

KristofferC pushed a commit that referenced this issue Mar 31, 2020

add fast math options to intrinsics and hook into fastmath macro (#1)

bdfd585

Segfault when creating a Vec{7, Float64} #1

Segfault when creating a Vec{7, Float64} #1

Comments

KristofferC commented Feb 3, 2016

eschnett commented Feb 4, 2016

KristofferC commented Feb 5, 2016

KristofferC commented Feb 5, 2016

KristofferC commented Feb 5, 2016

eschnett commented Feb 5, 2016

KristofferC commented Feb 5, 2016

eschnett commented Feb 5, 2016

KristofferC commented Feb 6, 2016

KristofferC commented Feb 6, 2016

eschnett commented Feb 6, 2016

eschnett commented Feb 6, 2016

eschnett commented Feb 6, 2016

KristofferC commented Feb 7, 2016

eschnett commented Feb 7, 2016

KristofferC commented Feb 7, 2016

KristofferC commented Feb 7, 2016

eschnett commented Feb 7, 2016

KristofferC commented Feb 7, 2016

eschnett commented Feb 8, 2016

eschnett commented Feb 8, 2016

KristofferC commented Feb 8, 2016

KristofferC commented Feb 8, 2016

eschnett commented Feb 8, 2016

ArchRobison commented Feb 8, 2016

ArchRobison commented Feb 9, 2016

KristofferC commented Feb 9, 2016

KristofferC commented Feb 9, 2016

ArchRobison commented Feb 9, 2016

KristofferC commented Feb 9, 2016

KristofferC commented Feb 9, 2016

ArchRobison commented Feb 9, 2016

eschnett commented Feb 18, 2016

ArchRobison commented Feb 18, 2016

eschnett commented Feb 19, 2016

ArchRobison commented Feb 19, 2016

ArchRobison commented Apr 28, 2016