Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CartesianIndex and @turbo loop #359

Open
klowrey opened this issue Nov 10, 2021 · 12 comments
Open

CartesianIndex and @turbo loop #359

klowrey opened this issue Nov 10, 2021 · 12 comments

Comments

@klowrey
Copy link

klowrey commented Nov 10, 2021

LV has been awesome for effortlessly speeding up code! Thanks!
I've recently run into some errors I don't know enough about to fix, reproduced with errors below with version v0.12.96. They're modifications of the image filtering example code. Currently the errors seem a bit too specific to me to address, and was hoping for help, or to report a bug if there's one.

In short my project involves kernels but over sparse, specific points rather than every index of a matrix so only convolving the kernel at specific CartesianIndex's seems elegant. The outerr issue seems to take offence with making the CartesianIndex itself, while the inerr issue is beyond me but may be related; maybe @turbo doesn't like the CartesianIndex being dynamically created? The nominal image filtering example works without problems. In a previous version of LV, I had a block of code similar to the inerr function below working.

using LoopVectorization, OffsetArrays

x = rand(10,10)
kern = Images.Kernel.gaussian((1, 1), (3, 3))

function outerr(x, kern)
    r = size(x,1)
    s = zero(eltype(x))
    @turbo for i=2:r-1
        idx = CartesianIndex(i,i)
        for K in CartesianIndices(kern)
            s += x[idx+K] + kern[K]
        end
    end
    s
end

function inerr(x, kern)
    r = size(x,1)
    s = zero(eltype(x))
    for i=2:r-1
        idx = CartesianIndex(i,i)
        @turbo for K in CartesianIndices(kern)
            s += x[idx+K] + kern[K]
        end
    end
    s
end

julia> outerr(x, kern)
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
 [1] fieldcount(t::Any)
   @ Base ./reflection.jl:764
 [2] _append_fields!(t::Expr, body::Expr, sym::Symbol, #unused#::Type{TypeVar})
   @ LoopVectorization ~/.julia/packages/LoopVectorization/umAIq/src/condense_loopset.jl:12
 [3] _append_fields!(t::Expr, body::Expr, sym::Symbol, #unused#::Type{UnionAll}) (repeats 3 times)
   @ LoopVectorization ~/.julia/packages/LoopVectorization/umAIq/src/condense_loopset.jl:19
 [4] #s166#66
   @ ~/.julia/packages/LoopVectorization/umAIq/src/condense_loopset.jl:34 [inlined]
 [5] var"#s166#66"(T::Any, ::Any, r::Any)
   @ LoopVectorization ./none:0
 [6] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
   @ Core ./boot.jl:580
 [7] outerr(x::Matrix{Float64}, kern::OffsetMatrix{Float64, Matrix{Float64}})
   @ Main /tmp/lv.jl:11
 [8] top-level scope
   @ REPL[4]:1

julia> inerr(x, kern)
ERROR: MethodError: no method matching _gesp(::LayoutPointers.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}, ::Static.StaticInt{1}, ::CartesianIndex{2}, ::Static.StaticInt{1})
Closest candidates are:
  _gesp(::LayoutPointers.FastRange, ::Static.StaticInt{1}, ::Any, ::Static.StaticInt{1}) at ~/.julia/packages/LoopVectorization/umAIq/src/vectorizationbase_compat/subsetview.jl:38
  _gesp(::LayoutPointers.AbstractStridedPointer{T, N, C, B, R, X, O} where {C, B, R, X<:Tuple{Vararg{Integer, N}}, O<:Tuple{Vararg{Integer, N}}}, ::Static.StaticInt{I}, ::Integer, ::Static.StaticInt{D}) where {I, N, T, D} at ~/.julia/packages/LoopVectorization/umAIq/src/vectorizationbase_compat/subsetview.jl:39
Stacktrace:
 [1] inerr(x::Matrix{Float64}, kern::OffsetMatrix{Float64, Matrix{Float64}})
   @ Main /tmp/lv.jl:25
 [2] top-level scope
   @ REPL[5]:
@chriselrod
Copy link
Member

maybe @turbo doesn't like the CartesianIndex being dynamically created?

Yes. @turbo actually doesn't like CartesianIndex at all.
It has special handling for CartesianIndices, where it turns CartesianIndices{N} into N separate loops, thereby avoiding CartesianIndex altogether.

If someone is interested in fixing this, I can provide instructions/guidance.

Otherwise, this will have to wait for the LoopVectorization rewrite, which I'm trying to prioritize at the moment.
Thus, I wouldn't expect it be solved for the next half year+. I'm expecting it to be a while.

@klowrey
Copy link
Author

klowrey commented Nov 10, 2021

Without using CartesianIndices, we can opt for loops instead. I experimented with summing at different levels in the stack of for loops and came across what might be a bug in the ininsum function below. The major differences between the functions are where sums are accumulated, with ininsum accumulating at two different levels which results in a total (outer loop) sum of 0.0 when used with @turbo. The difference in total sums is probably due to fastmath, but can be surprisingly different sometimes.

using LoopVectorization, OffsetArrays, BenchmarkTools

x = rand(100,100)
kern = Images.Kernel.gaussian((1, 1), (3, 3))

function noturbosum(x, kern)
    ks = zero(eltype(x))
    @inbounds for i=2:size(x,1)-1
        for j in axes(kern,1), m in axes(kern,2)
            ks += x[i+j, i+m] * kern[j,m]
        end
    end
    ks
end

function outersum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        for j in axes(kern,1), m in axes(kern,2)
            ks += x[i+j, i+m] * kern[j,m]
        end
    end
    ks
end

function insum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        s1 = zero(eltype(x))
        for j in axes(kern,1), m in axes(kern,2)
            s1 += x[i+j, i+m] * kern[j,m]
        end
        ks += s1
    end
    ks
end

function in2sum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        for j in axes(kern,1)
            s0 = zero(eltype(x))
            for m in axes(kern,2)
                s0 += x[i+j, i+m] * kern[j,m]
            end
            ks += s0
        end
    end
    ks
end

function ininsum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        s1 = zero(eltype(x))
        for j in axes(kern,1)
            s0 = zero(eltype(x))
            for m in axes(kern,2)
                s0 += x[i+j, i+m] * kern[j,m]
            end
            s1 += s0
        end
        ks += s1
    end
    ks
end

println("outer ", noturbosum(x, kern))
println("outer ", outersum(x, kern))
println("in    ", insum(x, kern))
println("in2   ", in2sum(x, kern))
println("inin  ", ininsum(x, kern))

display(@btime outersum($x, $kern))
display(@btime insum($x, $kern))
display(@btime in2sum($x, $kern))
display(@btime ininsum($x, $kern))

noturbo 49.06428455119616
outer     46.612923559459894
in          46.447873809237635
in2        46.447873809237635
inin         0.0
  3.379 μs (0 allocations: 0 bytes)
46.612923559459894
  302.369 ns (0 allocations: 0 bytes)
46.447873809237635
  168.696 ns (0 allocations: 0 bytes)
46.447873809237635
  23.303 ns (0 allocations: 0 bytes)
0.0

I probably can't dedicate time to fixing CartesianIndices but would love instructions/guidance on doing SIMD with dual numbers (I noticed the https://github.com/JuliaSIMD/SIMDDualNumbers.jl package and was trying to grok it). For the functions I'm writing, I've found that manually using Dual numbers from FowardDiff.jl rather than going through their gradient/derivative functions interface is much more performant, but I'm not sure how to use them with SIMD/LV, etc., and would love guidance!

chriselrod added a commit that referenced this issue Nov 10, 2021
@chriselrod
Copy link
Member

The difference in total sums is probably due to fastmath, but can be surprisingly different sometimes.

Differences that big are definitely bugs.
The bug in the first three was straightforward to fix, so I did.
inin is trickier, so I'll also leave that to the rewrite.
Adding @fastmath to noturbosum, I get:

julia> display(@btime noturbosum_nofm($x, $kern)) # no `@fastmath`, no `@turbo`
  965.667 ns (0 allocations: 0 bytes)
225.36596301915785

julia> display(@btime noturbosum($x, $kern)) # @fastmath, no @turbo
  991.838 ns (0 allocations: 0 bytes)
225.36596301915785

julia> display(@btime outersum($x, $kern))
  224.695 ns (0 allocations: 0 bytes)
225.365963019158

julia> display(@btime insum($x, $kern))
  378.662 ns (0 allocations: 0 bytes)
225.36596301915802

julia> display(@btime in2sum($x, $kern))
  273.545 ns (0 allocations: 0 bytes)
225.36596301915804

julia> display(@btime ininsum($x, $kern))
  125.409 ns (6 allocations: 304 bytes)
0.0

For the functions I'm writing, I've found that manually using Dual numbers from FowardDiff.jl rather than going through their gradient/derivative functions interface is much more performant, but I'm not sure how to use them with SIMD/LV, etc., and would love guidance!

You should be able to solve multiple ForwardDiff.gradients in parallel. You can see some examples here:
https://discourse.julialang.org/t/making-a-fast-evaluation-of-a-function-and-its-gradient/68884

Something I'd like to get around to eventually is supporting using VectorizationBase.Vecs for dual number partials to make sure they SIMD.

@svretina
Copy link
Contributor

sorry to revive this, is it possible to give a hint what is causing the first error message?

ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace: [etc]

I have a function with a single for loop which I want to call in a nested way:

@inline function integrate_avx(args) 
    @turbo for i in 1:length(x)
        # stuff
        s += w[i] * (f(xm) + f(xp)) # a reduction
    end
    return s
end

then I want to write my 2D function based on the 1D, namely:

function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
                       x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
    function f1(x1::T) where {T<:Real}
        g1(y::T) where {T} = f(x1, y)
        res = integrate_avx(g1, xmin[2], xmax[2], x, w, h)
        return res
    end
    g2(x1::T) where {T} = f1(x1)
    res = integrate_avx(g2, xmin[1], xmax[1], x, w, h)
    return res
end

This gives me

ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
  [1] fieldcount(t::Any)
    @ Base ./reflection.jl:895
  [2] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Core.Box})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:12
  [3] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{ScalarWave.TanhSinhQuadrature.var"#f1#104"{var"#77#78", SVector{2, Float64}, SVector{2, Float64}, Vector{Float64}, Vector{Float64}, Float64}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [4] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{ScalarWave.TanhSinhQuadrature.var"#g2#106"{ScalarWave.TanhSinhQuadrature.var"#f1#104"{var"#77#78", SVector{…}, SVector{…}, Vector{…}, Vector{…}, Float64}}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [5] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Tuple{LayoutPointers.GroupedStridedPointers{…}, Float64, Float64, ScalarWave.TanhSinhQuadrature.var"#g2#106"{…}, ScalarWave.TanhSinhQuadrature.var"#g2#106"{…}, DataType}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [6] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Tuple{Tuple{…}, Tuple{…}}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [7] #s223#66
    @ ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:34 [inlined]
  [8] var"#s223#66"(T::Any, ::Any, r::Any)
    @ LoopVectorization ./none:0
  [9] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
    @ Core ./boot.jl:602
 [10] integrate_avx
    @ ~/Codes/PhD/ScalarWave/src/TanhSinhQuadrature.jl:143 [inlined]
 [11] integrate_avx(f::var"#77#78", xmin::SVector{2, Float64}, xmax::SVector{2, Float64}, x::Vector{Float64}, w::Vector{Float64}, h::Float64)
    @ ScalarWave.TanhSinhQuadrature ~/Codes/PhD/ScalarWave/src/TanhSinhQuadrature.jl:173
 [12] top-level scope
    @ REPL[231]:1
 [13] top-level scope
    @ ~/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1428
Some type information was truncated. Use `show(err)` to see complete types.

@chriselrod
Copy link
Member

ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
  [1] fieldcount(t::Any)
    @ Base ./reflection.jl:895
  [2] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Core.Box})

You have a Core.Box in your closure. Get rid of it.

@svretina
Copy link
Contributor

thanks for your answer!
Indeed writing the closure a bit differently eliminates the above error.
But the "outer" call again is not suitable for simd ?

@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
    x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @inline function f2(x1::T) where {T<:Real}
        return integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
    end
    return integrate_avx(x1 -> f2(x1), xmin[1], xmax[1], x, w, h) # <- this doesn't pass the check_args
end
┌ Warning: #= /home/svretina/Codes/PhD/FastTanhSinhQuadrature.jl/src/FastTanhSinhQuadrature.jl:158 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ FastTanhSinhQuadrature ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:1148

@chriselrod
Copy link
Member

As one aside

function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
                       x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
    function f1(x1::T) where {T<:Real}
        g1(y::T) where {T} = f(x1, y)
        res = integrate_avx(g1, xmin[2], xmax[2], x, w, h)
        return res
    end
    g2(x1::T) where {T} = f1(x1)
    res = integrate_avx(g2, xmin[1], xmax[1], x, w, h)
    return res
end

looks like it is going to be putting res in a Core.Box.
I would simplify to

function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
                       x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
    function f1(x1::T) where {T<:Real}
        g1(y::T) where {T} = f(x1, y)
        return integrate_avx(g1, xmin[2], xmax[2], x, w, h)
    end
    g2(x1::T) where {T} = f1(x1)
    return integrate_avx(g2, xmin[1], xmax[1], x, w, h)
end

to avoid that.

What are the types? Do you have a reproducer I can copy/paste?

You could try deving the package to modify the check functions to print the types when returning false, so you know what the particular problematic example was.

@inline function check_args(A::AbstractArray{T}) where {T}
check_type(T) && check_device(ArrayInterface.device(A))
end
@inline check_args(A::BitVector) = true
@inline check_args(A::BitArray) = iszero(size(A, 1) & 7)
@inline check_args(::VectorizationBase.AbstractStridedPointer) = true
@inline function check_args(x)
# @info "`LoopVectorization.check_args(::$(typeof(x))) == false`, therefore compiling a probably slow `@inbounds @fastmath` fallback loop." maxlog=1
# DEBUG: @show @__LINE__, typeof(x)
false
end
@inline check_args(A, B, C::Vararg{Any,K}) where {K} =
check_args(A) && check_args(B, C...)
@inline check_args(::AbstractRange{T}) where {T} = check_type(T)
@inline check_args(::UpTri) = false
@inline check_args(::LoTri) = false
@inline check_args(::Diagonal) = false
@inline check_args(::Type{T}) where {T} = check_type(T)
@inline check_args(::Tuple{T,Vararg{T,K}}) where {T,K} = check_type(T)
"""
check_type(::Type{T}) where {T}
Returns true if the element type is supported.
"""
@inline check_type(::Type{T}) where {T<:NativeTypes} = true
@inline function check_type(::Type{T}) where {T}
# DEBUG: @show @__LINE__, T
false
end
@inline check_type(::Type{T}) where {T<:AbstractSIMD} = true
@inline check_device(::ArrayInterface.CPUPointer) = true
@inline check_device(::ArrayInterface.CPUTuple) = true
@inline function check_device(x)
# DEBUG: @show @__LINE__, typeof(x)
false
end
function check_args_call(ls::LoopSet)
q = Expr(:call, lv(:check_args))
append!(q.args, ls.includedactualarrays)
for r ls.outer_reductions
push!(q.args, Expr(:call, :typeof, name(ls.operations[r])))
end
q
end

@svretina
Copy link
Contributor

The code is from here, but here is a MWE:

using StaticArrays
using LoopVectorization

@inline function ordinate(t::T) where {T<:Real}
    return @fastmath tanh(T(π) / 2 * sinh(t))
end
@inline function weight(t::T) where {T<:Real}
    @fastmath tmp = cosh(T(π) / 2 * sinh(t))
    return @fastmath ((T(π) / 2) * cosh(t)) / (tmp * tmp)
end

@inline function inv_ordinate(t::T) where {T<:Real}
    return @fastmath asinh(log((one(T) + t) / (one(T) - t)) / T(π))
end

function tanhsinh(::Type{T}, n::Int) where {T<:AbstractFloat}
    tmax = inv_ordinate(prevfloat(one(T)))
    h = tmax / n
    t = h:h:tmax
    x = ordinate.(t)
    w = weight.(t)
    N = length(x)
    if n < 100
        return SVector{N,T}(x), SVector{N,T}(w), h
    else
        return x, w, h
    end
end

# no simd 
@inline function integrate(f::S, xmin::T, xmax::T, x::AbstractVector{T},
    w::AbstractVector{T}, h::T) where {T<:Real,S}
    @fastmath @inbounds begin
        Δx = (xmax - xmin) / 2
        x₀ = (xmax + xmin) / 2
        s = weight(zero(T)) * f(x₀)
        for i in 1:length(x)
            xp = x₀ + Δx * x[i]
            xm = x₀ - Δx * x[i]
            if xm > xmin
                s += w[i] * f(xm)
            end
            if xp < xmax
                s += w[i] * f(xp)
            end
        end
    end
    return Δx * h * s
end

# function using loopvectorization 
@inline function integrate_avx(f::S, xmin::T, xmax::T, x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @fastmath Δx = (xmax - xmin) / 2
    @fastmath x₀ = (xmax + xmin) / 2
    @fastmath s = weight(zero(T)) * f(x₀)
    @turbo for i in 1:length(x)
        Δxxi = Δx * x[i]
        xp = x₀ + Δxxi
        xm = x₀ - Δxxi
        s += w[i] * (f(xm) + f(xp))
    end
    return @fastmath Δx * h * s
end

## 2D
# here I only use the avx version for the inner call
# if you call integrate_avx in the outter call ( in the return statement )
# the warning occurs
@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
    x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @inline function f2(x1::T) where {T<:Real}
        return @inbounds integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
    end
    return @inbounds integrate(x1 -> f2(x1), xmin[1], xmax[1], x, w, h)
end

x,w,h = tanhsinh(Float64, 40)
f(x,y) = x*y
lower = SVector{2,Float64}(0.0,0.0)
upper = SVector{2, Float64}(1.0,1.0)
integrate(f, lower, upper, x,w,h) # 0.2499999999999992

@chriselrod
Copy link
Member

It works for me. I don't get the warning, and I see @turbo is getting used.

@svretina
Copy link
Contributor

did you change the second call to

## 2D
# here I only use the avx version for the inner call
# if you call integrate_avx in the outter call ( in the return statement )
# the warning occurs
@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
    x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @inline function f2(x1::T) where {T<:Real}
        return @inbounds integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
    end
    return @inbounds integrate_avx(x1 -> f2(x1), xmin[1], xmax[1], x, w, h)
end

@chriselrod
Copy link
Member

No, doing that, and I do see the warning.
You can't nest @turbos. Only one layer can SIMD.
You could write out loops manually and place an @turbo on the outermost loop, and it'll decide what to do with all the loops.

@svretina
Copy link
Contributor

thanks a lot for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants