You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using VectorizationBase, SIMDTypes
@eval@generatedfunctionfcmp_ogt(x::VectorizationBase.Vec{N, Float32}, y::VectorizationBase.Vec{N, Float32}) where {N}
s =""" %res = fcmp fast ogt <$(N) x float> %0, %1 %resb = sext <$(N) x i1> %res to <$(N) x i8> ret <$(N) x i8> %resb"""return :(
$(Expr(:meta, :inline));
Base.llvmcall($s, SIMDTypes._Vec{N, Int8}, Tuple{SIMDTypes._Vec{N, Float32}, SIMDTypes._Vec{N, Float32}}, VectorizationBase.data(x), VectorizationBase.data(y))
)
endfunctionhorizontal_or(x)
a = VectorizationBase.vconvert(VectorizationBase.Vec{8, Int32}, x)
b = VectorizationBase.vreinterpret(VectorizationBase.Vec{8, Float32}, a)
c = VectorizationBase.data(b)
returnccall("llvm.x86.avx.vtestz.ps.256", llvmcall, Int32, (SIMDTypes._Vec{8, Float32}, SIMDTypes._Vec{8, Float32}), c, c) ==0endfunctiontest(a, b)
c = VectorizationBase.Vec(fcmp_ogt(a, b))
horizontal_or(c)
end
The text was updated successfully, but these errors were encountered:
I guess it's pretty obvious that the second assembly is better than the first because it is missing a vmovmskps, but https://bit.ly/3zcwLRK https://bit.ly/3TLsJcp
The throughput is the same thanks to out of order execution, but I'd still say the latter wins because there are fewer instructions and fewer micro-ops: it won't interfere as much with anything else going on.
Perhaps an implementation like this?
@inlinefunctionvany(m::Mask{8})
x =reinterpret(Float32,convert(Int32,m)<<Int32(31))
c = VectorizationBase.data(x)
ccall("llvm.x86.avx.vtestz.ps.256", llvmcall, Int32, (SIMDTypes._Vec{8, Float32}, SIMDTypes._Vec{8, Float32}), c, c) ==0end
I think this is going to be suboptimal on AVX512 platforms, so the code should only generate something like this if you don't have AVX512.
Note that you should modify the above implementation to use sext instead of zext (like you are) so that you can drop the bitshift that I was forced to add.
Probably define a sext function.
It seems the relevant instructions are
I think this should be faster, however I've had trouble benchmarking this as these functions seem too small to accurately benchmark
We can get these instructions with the following code, but I don't know how to get these working in the example you gave at https://discourse.julialang.org/t/how-to-speed-up-loopvectorized-code-3x-slower-than-c-code/96440/15
The text was updated successfully, but these errors were encountered: