-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster min/max for IEEEFloats #45532
Conversation
Can you give me a test of this on every pair of |
My locate bench shows: (Compared with #41709) |
Timing this and #41709 on M1 mac:
Same ratios for Float64 and Float32. |
@mcabbott would you mind bench this version on M1? function __min(x, y)
diff = x - y
min = ifelse(signbit(diff), x, y)
anynan = isnan(x)|isnan(y)
ifelse(anynan, diff, min)
end
function __max(x, y)
diff = x - y
max = ifelse(signbit(diff), y, x)
anynan = isnan(x)|isnan(y)
ifelse(anynan, diff, max)
end
function __minmax(x, y)
diff = x - y
sdiff = signbit(diff)
min = ifelse(sdiff, x, y)
max = ifelse(sdiff, y, x)
anynan = isnan(x)|isnan(y)
min = ifelse(anynan, diff, min)
max = ifelse(anynan, diff, max)
min, max
end BTW, does M1 have native LLVM maximum support? I guess that should be the fastest choice. |
On M1
|
At least |
Testing on every function makequiet(x)
# set the quiet bit of a NaN
u = reinterpret(Unsigned, x)
qbit = one(u) << (Base.significand_bits(typeof(x))-1) # quiet bit in NaN
return reinterpret(typeof(x), u | qbit)
end
i16s = typemin(Int16):typemax(Int16)
for ix in i16s, iy in i16s
x = reinterpret(Float16, ix)
y = reinterpret(Float16, iy)
m = min(x, y)
if x < y
m == x || println(x,",",y," -> ",m)
elseif x > y
m == y || println(x,",",y," -> ",m)
elseif x == y
m == x == y && signbit(m) == signbit(x) | signbit(y) || println(x,",",y," -> ",m)
elseif isnan(x) | isnan(y)
makequiet(m) === makequiet(x) || makequiet(m) === makequiet(y) || println(x,",",y," -> ",m," : ",ix,",",iy," -> ",reinterpret(Int16,m))
end
end which took 25 minutes on my machine. I did a similar test for My machine lacks native I suppose the faster approach would be to promote to |
Non-native Looking at I'm torn. Scalar operation is probably more common in the places where performance is critical, but ultimately the difference won't add up to anything massive. Vectorized operation gives the potential to magnify a small performance difference, but again the I tried to tweak |
yeah. The only reason I brought up |
Certainly. I was referring to the other line of discussion looking at |
Well, I try function __min(x, y)
diff = x - y
min = ifelse(signbit(diff), x, y)
ifelse(isnan(diff), diff, min)
end The branch is still there but it saves one |
Would be nice if these made
NNlib avoids
|
I am aware these algorithms perform very poorly for reductions. I don't have a good explanation as to why except that part of it is that they don't SIMD in reductions. When looking at performance, be warned that I have a different approach (related to the |
That's great! I tried more thorough unroll to make |
This version will yield the incorrect results |
LLVM already has a issue to track X86's Therefore, I think it is more important to avoid possible degradation (as reported by @mcabbott). Although I can't reproduce that on my old x86... |
Having spent some time thinking about it, I think that the It'd be nice if we can just use @N5N3, you are managing #41709 which accomplishes everything this PR would. Would you like me to close this and you can move forward with that one or should I adopt your versions and continue this PR? |
I think keep working on #41709 seems simplier as it covered more cases. (In fact I think it's mergeable) |
Lets focus on the other PR then. |
Faster
min
,max
, andminmax
forIEEEFloat
(Float16
,Float32
,Float64
) types.Setup:
Before:
After:
map!
indicates the straight-line performance whilebroadcast!
allows for SIMD. This PR increases throughput formin
/max
by about 35%,minmax
by 25%, and enablesminmax
to SIMD (additional 225% throughput).Note that the
AbstractFloat
-specializedminmax
was deleted and theReal
version redefined to simplyminmax(x,y) = min(x,y),max(x,y)
. The new version is faster forIEEEFloat
(with themin
/max
changes), does not change performance forBitInteger
, and does not apply toBigFloat
.The
AbstractFloat
-specialized versions ofmin
/max
are now dead toBase
types (BigFloat
has its own specialization andIEEEFloat
is superseded by the new specializations in this PR). I was tempted to remove these methods, but that would be breaking for any user-definedAbstractFloat
types that rely on them. I'm not sure that any do, but I didn't feel the need to break things in this PR.