Add a vectorized vany #481

Zentrik · 2023-03-27T20:51:27Z

using VectorizationBase

f(x, y) = VectorizationBase.vany(x > y)
x = VectorizationBase.Vec{8, Float32}(0)
y = VectorizationBase.Vec{8, Float32}(-1)
@code_native f(x, y)

It seems the relevant instructions are

vcmpltps        %ymm0, %ymm1, %ymm2
vmovmskps %ymm2, %eax
testl %eax, %eax

I think this should be faster, however I've had trouble benchmarking this as these functions seem too small to accurately benchmark

vcmpltps        %ymm0, %ymm1, %ymm2
vtestps         %ymm2, %ymm2

We can get these instructions with the following code, but I don't know how to get these working in the example you gave at https://discourse.julialang.org/t/how-to-speed-up-loopvectorized-code-3x-slower-than-c-code/96440/15

using VectorizationBase, SIMDTypes

@eval @generated function fcmp_ogt(x::VectorizationBase.Vec{N, Float32}, y::VectorizationBase.Vec{N, Float32}) where {N}
    s = """
    %res = fcmp fast ogt <$(N) x float> %0, %1
    %resb = sext <$(N) x i1> %res to <$(N) x i8>
    ret <$(N) x i8> %resb
    """
    return :(
        $(Expr(:meta, :inline));
        Base.llvmcall($s, SIMDTypes._Vec{N, Int8}, Tuple{SIMDTypes._Vec{N, Float32}, SIMDTypes._Vec{N, Float32}}, VectorizationBase.data(x), VectorizationBase.data(y))
    )
end

function horizontal_or(x)
    a = VectorizationBase.vconvert(VectorizationBase.Vec{8, Int32}, x)
    b = VectorizationBase.vreinterpret(VectorizationBase.Vec{8, Float32}, a)
    c = VectorizationBase.data(b)
    return ccall("llvm.x86.avx.vtestz.ps.256", llvmcall, Int32, (SIMDTypes._Vec{8, Float32}, SIMDTypes._Vec{8, Float32}), c, c) == 0
end

function test(a, b)
    c = VectorizationBase.Vec(fcmp_ogt(a, b))
    horizontal_or(c)
end

chriselrod · 2023-03-27T22:51:19Z

I guess it's pretty obvious that the second assembly is better than the first because it is missing a vmovmskps, but
https://bit.ly/3zcwLRK
https://bit.ly/3TLsJcp
The throughput is the same thanks to out of order execution, but I'd still say the latter wins because there are fewer instructions and fewer micro-ops: it won't interfere as much with anything else going on.

Perhaps an implementation like this?

@inline function vany(m::Mask{8})
    x = reinterpret(Float32,convert(Int32,m)<<Int32(31))
    c = VectorizationBase.data(x)
    ccall("llvm.x86.avx.vtestz.ps.256", llvmcall, Int32, (SIMDTypes._Vec{8, Float32}, SIMDTypes._Vec{8, Float32}), c, c) == 0
end

I think this is going to be suboptimal on AVX512 platforms, so the code should only generate something like this if you don't have AVX512.

Note that you should modify the above implementation to use sext instead of zext (like you are) so that you can drop the bitshift that I was forced to add.
Probably define a sext function.

You could make a PR to VectorizationBase.

Zentrik mentioned this issue Mar 29, 2023

Add vectorized vany for 8 wide masks JuliaSIMD/VectorizationBase.jl#101

Merged

Zentrik closed this as completed Apr 5, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a vectorized vany #481

Add a vectorized vany #481

Zentrik commented Mar 27, 2023 •

edited

Loading

chriselrod commented Mar 27, 2023

Add a vectorized vany #481

Add a vectorized vany #481

Comments

Zentrik commented Mar 27, 2023 • edited Loading

chriselrod commented Mar 27, 2023

Zentrik commented Mar 27, 2023 •

edited

Loading