faster min/max for IEEEFloats #45532

mikmoore · 2022-05-31T23:57:51Z

Faster min, max, and minmax for IEEEFloat (Float16, Float32, Float64) types.

Setup:

using BenchmarkTools

a = randn(1000);
b = randn(size(a));
z = similar(a);
zz = Array{NTuple{2,Float64}}(undef,size(a));

@btime map!(min, $z, $a, $b);
@btime map!(max, $z, $a, $b);
@btime map!(minmax, $zz, $a, $b);

@btime broadcast!(min, $z, $a, $b);
@btime broadcast!(max, $z, $a, $b);
@btime broadcast!(minmax, $zz, $a, $b);

Before:

  1.790 μs (0 allocations: 0 bytes)
  1.780 μs (0 allocations: 0 bytes)
  2.244 μs (0 allocations: 0 bytes)
  308.502 ns (0 allocations: 0 bytes)
  307.631 ns (0 allocations: 0 bytes)
  1.970 μs (0 allocations: 0 bytes)

After:

  1.240 μs (0 allocations: 0 bytes)
  1.240 μs (0 allocations: 0 bytes)
  1.790 μs (0 allocations: 0 bytes)
  227.292 ns (0 allocations: 0 bytes)
  227.292 ns (0 allocations: 0 bytes)
  552.128 ns (0 allocations: 0 bytes)

map! indicates the straight-line performance while broadcast! allows for SIMD. This PR increases throughput for min/max by about 35%, minmax by 25%, and enables minmax to SIMD (additional 225% throughput).

Note that the AbstractFloat-specialized minmax was deleted and the Real version redefined to simply minmax(x,y) = min(x,y),max(x,y). The new version is faster for IEEEFloat (with the min/max changes), does not change performance for BitInteger, and does not apply to BigFloat.

The AbstractFloat-specialized versions of min/max are now dead to Base types (BigFloat has its own specialization and IEEEFloat is superseded by the new specializations in this PR). I was tempted to remove these methods, but that would be breaking for any user-defined AbstractFloat types that rely on them. I'm not sure that any do, but I didn't feel the need to break things in this PR.

oscardssmith · 2022-06-01T00:11:02Z

Can you give me a test of this on every pair of Float16? Other than that (just to be sure this handles all the edge cases), this looks great.

N5N3 · 2022-06-01T00:20:11Z

My locate bench shows: (Compared with #41709)
Float32/Float64: this PR has faster map!, but slight slower broadcast!.
Float16: this PR seems slower on both case. (My PC has no Native Float16 support)

mcabbott · 2022-06-01T01:33:17Z

Timing this and #41709 on M1 mac:

This has no effect on map! speed, while 41709 takes 60% as long
Both improve broadcast!, this one better: 73% as long, vs 41709 at 79% as long.

Same ratios for Float64 and Float32.

N5N3 · 2022-06-01T01:39:34Z

@mcabbott would you mind bench this version on M1?

function __min(x, y)
    diff = x - y
    min = ifelse(signbit(diff), x, y)
    anynan = isnan(x)|isnan(y)
    ifelse(anynan, diff, min)
end

function __max(x, y)
    diff = x - y
    max = ifelse(signbit(diff), y, x)
    anynan = isnan(x)|isnan(y)
    ifelse(anynan, diff, max)
end

function __minmax(x, y)
    diff = x - y
    sdiff = signbit(diff)
    min = ifelse(sdiff, x, y)
    max = ifelse(sdiff, y, x)
    anynan = isnan(x)|isnan(y)
    min = ifelse(anynan, diff, min)
    max = ifelse(anynan, diff, max)
    min, max
end

BTW, does M1 have native LLVM maximum support? I guess that should be the fastest choice.

mcabbott · 2022-06-01T02:03:36Z

On M1 __min & __max take

the same 60% time for map! as 41709
62% for Float64 and 52% for Float32 in broadcast!.

N5N3 · 2022-06-01T02:16:13Z

At least broadcast! is faster. (Another cpu-dependent optimizaiton.)

mikmoore · 2022-06-01T02:58:43Z

Testing on every Float16:

function makequiet(x)
	# set the quiet bit of a NaN
	u = reinterpret(Unsigned, x)
	qbit = one(u) << (Base.significand_bits(typeof(x))-1) # quiet bit in NaN
	return reinterpret(typeof(x), u | qbit)
end

i16s = typemin(Int16):typemax(Int16)
for ix in i16s, iy in i16s
	x = reinterpret(Float16, ix)
	y = reinterpret(Float16, iy)
	m = min(x, y)
	if x < y
		m == x || println(x,",",y," -> ",m)
	elseif x > y
		m == y || println(x,",",y," -> ",m)
	elseif x == y
		m == x == y && signbit(m) == signbit(x) | signbit(y) || println(x,",",y," -> ",m)
	elseif isnan(x) | isnan(y)
		makequiet(m) === makequiet(x) || makequiet(m) === makequiet(y) || println(x,",",y," -> ",m," : ",ix,",",iy," -> ",reinterpret(Int16,m))
	end
end

which took 25 minutes on my machine. I did a similar test for max. Neither produced any notices.

My machine lacks native Float16 support so Float16 is tragically slow. Without native support, it's likely that many Float16 operations will be slow and not just the ones here. So I don't know how critical that performance is.

I suppose the faster approach would be to promote to Float32 for non-native Float16. However, it would seem better to make the fundamental operations (<, >, +, isnan) faster for Float16 (if only by promotion) and then the performance here would be less horrid.

mikmoore · 2022-06-01T14:55:37Z

Non-native Float16 performance should not be a priority, in my opinion. Someone without native Float16 is unlikely to use them and will suffer performance everywhere if they do. I imagine there's no qualitative difference when using native Float16 as compared to native Float32/Float64, since all candidates here are based on primitive float and logical instructions.

Looking at Float64, I see better performance of this PR's min on map! and better for __min on broadcast!. The discrepancy is easy looking at the code_native. On my Intel machine, this min is branchless in both cases, whereas __min uses one branch in the map (and scalar) case but manages to replace this with a vblendvpd when broadcast (SIMD).

I'm torn. Scalar operation is probably more common in the places where performance is critical, but ultimately the difference won't add up to anything massive. Vectorized operation gives the potential to magnify a small performance difference, but again the min/max call is likely only a small part of anything but a micro-benchmark. In either case, the difference is only 10-20%.

I tried to tweak __min to convince it to drop the branch, but couldn't. If someone can manage to do that, that would likely yield the best implementation.

oscardssmith · 2022-06-01T14:59:17Z

yeah. The only reason I brought up Float16 is that it is possible to exhaustively test which is good because for an operation like min/max, all the floating point types have all the same edge cases.

mikmoore · 2022-06-01T15:03:13Z

Certainly. I was referring to the other line of discussion looking at Float16 performance.

N5N3 · 2022-06-01T15:11:23Z

Well, I try

function __min(x, y)
    diff = x - y
    min = ifelse(signbit(diff), x, y)
    ifelse(isnan(diff), diff, min)
end

The branch is still there but it saves one vmovapd %xmm1, %xmm3. (And it saves about 10% time on my 9600) broadcast!'s speed keeps same so I didn't check the IR.
@mikmoore would you mind re-check it?

mcabbott · 2022-06-01T15:17:27Z

Would be nice if these made minimum fast, that's one place they would be called many times i na simple pattern. What am I doing wrong?

julia> let x = randn(1000)
         @btime minimum($x)
         @btime reduce(min, $x; init=Inf)   # (Using instead (x,y) -> min(x,y) gives identical time. Without init, same as minimum.)
         @btime reduce(_min, $x; init=Inf)  # PR 45532
         @btime reduce(__min, $x; init=Inf) # from N5N3, first version
       end;
  1.404 μs (0 allocations: 0 bytes)
  4.006 μs (0 allocations: 0 bytes)
  5.243 μs (0 allocations: 0 bytes)
  4.899 μs (0 allocations: 0 bytes)

NNlib avoids max in some functions which are commonly broadcasted. Perhaps surprising results:

julia> begin
         relu6_a(x) = min(max(0, x), 6)  # naiive
         relu6_b(x) = clamp(x, 0, 6)     # what NNlib does, faster
         relu6_c(x) = __min(__max(0.0, x), 6.0) 
       end;

julia> let x = randn(1000)
           y = similar(x)
           @btime $y .= relu6_a.($x)
           @btime $y .= relu6_b.($x)
           @btime $y .= relu6_c.($x)
       end;
  325.543 ns (0 allocations: 0 bytes)
  144.737 ns (0 allocations: 0 bytes)
  554.144 ns (0 allocations: 0 bytes)

mikmoore · 2022-06-01T15:36:37Z

I am aware these algorithms perform very poorly for reductions. I don't have a good explanation as to why except that part of it is that they don't SIMD in reductions.

When looking at performance, be warned that minimum calls to reduce(min,...) (similar for max variants) and this goes through a specialization that does not use min at all (look for the min/max specialization of mapreduce_impl in base/reduce.jl). To evade it, use reduce((x,y)->min(x,y),...). Also note that this specialization inadvertently results in a massive 3x slowdown for integer minimum compared to the the non-specializing reduce((x,y)->min(x,y),...).

I have a different approach (related to the isless implementation) for float minimum/maximum that is more than 4x faster than what exists now and should avoid the integer slowdown. I will finalize it and open a PR when I get a chance.

N5N3 · 2022-06-01T15:51:38Z

I have a different approach (related to the isless implementation) for float minimum/maximum that is more than 4x faster than what exists now and should avoid the integer slowdown. I will finalize it and open a PR when I get a chance.

That's great! I tried more thorough unroll to make minimum faster in #43725. But it increases TTFP. Hope we can close #31442 this time.

mikmoore · 2022-06-01T15:55:24Z

Well, I try
function __min(x, y)
    diff = x - y
    min = ifelse(signbit(diff), x, y)
    ifelse(isnan(diff), diff, min)
end
The branch is still there but it saves one vmovapd %xmm1, %xmm3. (And it saves about 10% time on my 9600) broadcast!'s speed keeps same so I didn't check the IR. @mikmoore would you mind re-check it?

This version will yield the incorrect results __min(Inf,Inf)->NaN, __min(-Inf,-Inf)->NaN.

N5N3 · 2022-06-02T01:46:25Z

LLVM already has a issue to track X86's minimum/maximum. llvm/llvm-project#53353
Since the optimizaiton is somehow platform dependent, I believe the best choice is use LLVM's {min,max}imum when it's available.
This should also enable auto-vectorlization and make foldl(min, a) as fast as reduce(min, a).

Therefore, I think it is more important to avoid possible degradation (as reported by @mcabbott). Although I can't reproduce that on my old x86...

mikmoore · 2022-06-15T01:27:47Z

Having spent some time thinking about it, I think that the __min and __max proposed by @N5N3 are the best candidates. Those versions performed better than the original versions proposed in this PR in some situations and almost as well in others. But I think that if LLVM eventually manages to produce the code I wish it would, these will be indisputably the best.

It'd be nice if we can just use minmax(x,y) = min(x,y),max(x,y) and rely on inlining to deliver full efficiency, but we'll need to see whether that happens or if a manual definition is required.

@N5N3, you are managing #41709 which accomplishes everything this PR would. Would you like me to close this and you can move forward with that one or should I adopt your versions and continue this PR?

N5N3 · 2022-06-22T08:04:10Z

I think keep working on #41709 seems simplier as it covered more cases. (In fact I think it's mergeable)
The regression on M1 looks strange to me. But we can fix them by multiversion like fma.
Ping @oscardssmith for advice.

oscardssmith · 2022-06-22T12:20:40Z

Lets focus on the other PR then.

faster min/max for IEEEFloats

7d38ef3

oscardssmith added performance Must go faster maths Mathematical functions labels Jun 1, 2022

typo in comment

de5d3b0

mikmoore mentioned this pull request Jun 10, 2022

Faster minimum/maximum/extrema #45581

Draft

oscardssmith closed this Jun 22, 2022

N5N3 mentioned this pull request Jun 22, 2022

Make minmax faster for Float32/64 #41709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster min/max for IEEEFloats #45532

faster min/max for IEEEFloats #45532

mikmoore commented May 31, 2022

oscardssmith commented Jun 1, 2022

N5N3 commented Jun 1, 2022

mcabbott commented Jun 1, 2022 •

edited

Loading

N5N3 commented Jun 1, 2022 •

edited

Loading

mcabbott commented Jun 1, 2022

N5N3 commented Jun 1, 2022 •

edited

Loading

mikmoore commented Jun 1, 2022

mikmoore commented Jun 1, 2022

oscardssmith commented Jun 1, 2022

mikmoore commented Jun 1, 2022

N5N3 commented Jun 1, 2022 •

edited

Loading

mcabbott commented Jun 1, 2022 •

edited

Loading

mikmoore commented Jun 1, 2022

N5N3 commented Jun 1, 2022

mikmoore commented Jun 1, 2022

N5N3 commented Jun 2, 2022 •

edited

Loading

mikmoore commented Jun 15, 2022 •

edited

Loading

N5N3 commented Jun 22, 2022

oscardssmith commented Jun 22, 2022

faster min/max for IEEEFloats #45532

faster min/max for IEEEFloats #45532

Conversation

mikmoore commented May 31, 2022

oscardssmith commented Jun 1, 2022

N5N3 commented Jun 1, 2022

mcabbott commented Jun 1, 2022 • edited Loading

N5N3 commented Jun 1, 2022 • edited Loading

mcabbott commented Jun 1, 2022

N5N3 commented Jun 1, 2022 • edited Loading

mikmoore commented Jun 1, 2022

mikmoore commented Jun 1, 2022

oscardssmith commented Jun 1, 2022

mikmoore commented Jun 1, 2022

N5N3 commented Jun 1, 2022 • edited Loading

mcabbott commented Jun 1, 2022 • edited Loading

mikmoore commented Jun 1, 2022

N5N3 commented Jun 1, 2022

mikmoore commented Jun 1, 2022

N5N3 commented Jun 2, 2022 • edited Loading

mikmoore commented Jun 15, 2022 • edited Loading

N5N3 commented Jun 22, 2022

oscardssmith commented Jun 22, 2022

mcabbott commented Jun 1, 2022 •

edited

Loading

N5N3 commented Jun 1, 2022 •

edited

Loading

N5N3 commented Jun 1, 2022 •

edited

Loading

N5N3 commented Jun 1, 2022 •

edited

Loading

mcabbott commented Jun 1, 2022 •

edited

Loading

N5N3 commented Jun 2, 2022 •

edited

Loading

mikmoore commented Jun 15, 2022 •

edited

Loading