Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a vectorized vany #481

Closed
Zentrik opened this issue Mar 27, 2023 · 1 comment
Closed

Add a vectorized vany #481

Zentrik opened this issue Mar 27, 2023 · 1 comment

Comments

@Zentrik
Copy link

Zentrik commented Mar 27, 2023

using VectorizationBase

f(x, y) = VectorizationBase.vany(x > y)
x = VectorizationBase.Vec{8, Float32}(0)
y = VectorizationBase.Vec{8, Float32}(-1)
@code_native f(x, y)

It seems the relevant instructions are

vcmpltps        %ymm0, %ymm1, %ymm2
vmovmskps %ymm2, %eax
testl %eax, %eax

I think this should be faster, however I've had trouble benchmarking this as these functions seem too small to accurately benchmark

vcmpltps        %ymm0, %ymm1, %ymm2
vtestps         %ymm2, %ymm2

We can get these instructions with the following code, but I don't know how to get these working in the example you gave at https://discourse.julialang.org/t/how-to-speed-up-loopvectorized-code-3x-slower-than-c-code/96440/15

using VectorizationBase, SIMDTypes

@eval @generated function fcmp_ogt(x::VectorizationBase.Vec{N, Float32}, y::VectorizationBase.Vec{N, Float32}) where {N}
    s = """
    %res = fcmp fast ogt <$(N) x float> %0, %1
    %resb = sext <$(N) x i1> %res to <$(N) x i8>
    ret <$(N) x i8> %resb
    """
    return :(
        $(Expr(:meta, :inline));
        Base.llvmcall($s, SIMDTypes._Vec{N, Int8}, Tuple{SIMDTypes._Vec{N, Float32}, SIMDTypes._Vec{N, Float32}}, VectorizationBase.data(x), VectorizationBase.data(y))
    )
end

function horizontal_or(x)
    a = VectorizationBase.vconvert(VectorizationBase.Vec{8, Int32}, x)
    b = VectorizationBase.vreinterpret(VectorizationBase.Vec{8, Float32}, a)
    c = VectorizationBase.data(b)
    return ccall("llvm.x86.avx.vtestz.ps.256", llvmcall, Int32, (SIMDTypes._Vec{8, Float32}, SIMDTypes._Vec{8, Float32}), c, c) == 0
end

function test(a, b)
    c = VectorizationBase.Vec(fcmp_ogt(a, b))
    horizontal_or(c)
end
@chriselrod
Copy link
Member

I guess it's pretty obvious that the second assembly is better than the first because it is missing a vmovmskps, but
https://bit.ly/3zcwLRK
https://bit.ly/3TLsJcp
The throughput is the same thanks to out of order execution, but I'd still say the latter wins because there are fewer instructions and fewer micro-ops: it won't interfere as much with anything else going on.

Perhaps an implementation like this?

@inline function vany(m::Mask{8})
    x = reinterpret(Float32,convert(Int32,m)<<Int32(31))
    c = VectorizationBase.data(x)
    ccall("llvm.x86.avx.vtestz.ps.256", llvmcall, Int32, (SIMDTypes._Vec{8, Float32}, SIMDTypes._Vec{8, Float32}), c, c) == 0
end

I think this is going to be suboptimal on AVX512 platforms, so the code should only generate something like this if you don't have AVX512.

Note that you should modify the above implementation to use sext instead of zext (like you are) so that you can drop the bitshift that I was forced to add.
Probably define a sext function.

You could make a PR to VectorizationBase.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants