Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster min/max for IEEEFloats #45532

Closed
wants to merge 2 commits into from
Closed

Conversation

mikmoore
Copy link
Contributor

Faster min, max, and minmax for IEEEFloat (Float16, Float32, Float64) types.

Setup:

using BenchmarkTools

a = randn(1000);
b = randn(size(a));
z = similar(a);
zz = Array{NTuple{2,Float64}}(undef,size(a));

@btime map!(min, $z, $a, $b);
@btime map!(max, $z, $a, $b);
@btime map!(minmax, $zz, $a, $b);

@btime broadcast!(min, $z, $a, $b);
@btime broadcast!(max, $z, $a, $b);
@btime broadcast!(minmax, $zz, $a, $b);

Before:

  1.790 μs (0 allocations: 0 bytes)
  1.780 μs (0 allocations: 0 bytes)
  2.244 μs (0 allocations: 0 bytes)
  308.502 ns (0 allocations: 0 bytes)
  307.631 ns (0 allocations: 0 bytes)
  1.970 μs (0 allocations: 0 bytes)

After:

  1.240 μs (0 allocations: 0 bytes)
  1.240 μs (0 allocations: 0 bytes)
  1.790 μs (0 allocations: 0 bytes)
  227.292 ns (0 allocations: 0 bytes)
  227.292 ns (0 allocations: 0 bytes)
  552.128 ns (0 allocations: 0 bytes)

map! indicates the straight-line performance while broadcast! allows for SIMD. This PR increases throughput for min/max by about 35%, minmax by 25%, and enables minmax to SIMD (additional 225% throughput).

Note that the AbstractFloat-specialized minmax was deleted and the Real version redefined to simply minmax(x,y) = min(x,y),max(x,y). The new version is faster for IEEEFloat (with the min/max changes), does not change performance for BitInteger, and does not apply to BigFloat.

The AbstractFloat-specialized versions of min/max are now dead to Base types (BigFloat has its own specialization and IEEEFloat is superseded by the new specializations in this PR). I was tempted to remove these methods, but that would be breaking for any user-defined AbstractFloat types that rely on them. I'm not sure that any do, but I didn't feel the need to break things in this PR.

@oscardssmith oscardssmith added performance Must go faster maths Mathematical functions labels Jun 1, 2022
@oscardssmith
Copy link
Member

Can you give me a test of this on every pair of Float16? Other than that (just to be sure this handles all the edge cases), this looks great.

@N5N3
Copy link
Member

N5N3 commented Jun 1, 2022

My locate bench shows: (Compared with #41709)
Float32/Float64: this PR has faster map!, but slight slower broadcast!.
Float16: this PR seems slower on both case. (My PC has no Native Float16 support)

@mcabbott
Copy link
Contributor

mcabbott commented Jun 1, 2022

Timing this and #41709 on M1 mac:

  • This has no effect on map! speed, while 41709 takes 60% as long
  • Both improve broadcast!, this one better: 73% as long, vs 41709 at 79% as long.

Same ratios for Float64 and Float32.

@N5N3
Copy link
Member

N5N3 commented Jun 1, 2022

@mcabbott would you mind bench this version on M1?

function __min(x, y)
    diff = x - y
    min = ifelse(signbit(diff), x, y)
    anynan = isnan(x)|isnan(y)
    ifelse(anynan, diff, min)
end

function __max(x, y)
    diff = x - y
    max = ifelse(signbit(diff), y, x)
    anynan = isnan(x)|isnan(y)
    ifelse(anynan, diff, max)
end

function __minmax(x, y)
    diff = x - y
    sdiff = signbit(diff)
    min = ifelse(sdiff, x, y)
    max = ifelse(sdiff, y, x)
    anynan = isnan(x)|isnan(y)
    min = ifelse(anynan, diff, min)
    max = ifelse(anynan, diff, max)
    min, max
end

BTW, does M1 have native LLVM maximum support? I guess that should be the fastest choice.

@mcabbott
Copy link
Contributor

mcabbott commented Jun 1, 2022

On M1 __min & __max take

  • the same 60% time for map! as 41709
  • 62% for Float64 and 52% for Float32 in broadcast!.

@N5N3
Copy link
Member

N5N3 commented Jun 1, 2022

At least broadcast! is faster. (Another cpu-dependent optimizaiton.)

@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 1, 2022

Testing on every Float16:

function makequiet(x)
	# set the quiet bit of a NaN
	u = reinterpret(Unsigned, x)
	qbit = one(u) << (Base.significand_bits(typeof(x))-1) # quiet bit in NaN
	return reinterpret(typeof(x), u | qbit)
end

i16s = typemin(Int16):typemax(Int16)
for ix in i16s, iy in i16s
	x = reinterpret(Float16, ix)
	y = reinterpret(Float16, iy)
	m = min(x, y)
	if x < y
		m == x || println(x,",",y," -> ",m)
	elseif x > y
		m == y || println(x,",",y," -> ",m)
	elseif x == y
		m == x == y && signbit(m) == signbit(x) | signbit(y) || println(x,",",y," -> ",m)
	elseif isnan(x) | isnan(y)
		makequiet(m) === makequiet(x) || makequiet(m) === makequiet(y) || println(x,",",y," -> ",m," : ",ix,",",iy," -> ",reinterpret(Int16,m))
	end
end

which took 25 minutes on my machine. I did a similar test for max. Neither produced any notices.

My machine lacks native Float16 support so Float16 is tragically slow. Without native support, it's likely that many Float16 operations will be slow and not just the ones here. So I don't know how critical that performance is.

I suppose the faster approach would be to promote to Float32 for non-native Float16. However, it would seem better to make the fundamental operations (<, >, +, isnan) faster for Float16 (if only by promotion) and then the performance here would be less horrid.

@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 1, 2022

Non-native Float16 performance should not be a priority, in my opinion. Someone without native Float16 is unlikely to use them and will suffer performance everywhere if they do. I imagine there's no qualitative difference when using native Float16 as compared to native Float32/Float64, since all candidates here are based on primitive float and logical instructions.

Looking at Float64, I see better performance of this PR's min on map! and better for __min on broadcast!. The discrepancy is easy looking at the code_native. On my Intel machine, this min is branchless in both cases, whereas __min uses one branch in the map (and scalar) case but manages to replace this with a vblendvpd when broadcast (SIMD).

I'm torn. Scalar operation is probably more common in the places where performance is critical, but ultimately the difference won't add up to anything massive. Vectorized operation gives the potential to magnify a small performance difference, but again the min/max call is likely only a small part of anything but a micro-benchmark. In either case, the difference is only 10-20%.

I tried to tweak __min to convince it to drop the branch, but couldn't. If someone can manage to do that, that would likely yield the best implementation.

@oscardssmith
Copy link
Member

yeah. The only reason I brought up Float16 is that it is possible to exhaustively test which is good because for an operation like min/max, all the floating point types have all the same edge cases.

@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 1, 2022

Certainly. I was referring to the other line of discussion looking at Float16 performance.

@N5N3
Copy link
Member

N5N3 commented Jun 1, 2022

Well, I try

function __min(x, y)
    diff = x - y
    min = ifelse(signbit(diff), x, y)
    ifelse(isnan(diff), diff, min)
end

The branch is still there but it saves one vmovapd %xmm1, %xmm3. (And it saves about 10% time on my 9600) broadcast!'s speed keeps same so I didn't check the IR.
@mikmoore would you mind re-check it?

@mcabbott
Copy link
Contributor

mcabbott commented Jun 1, 2022

Would be nice if these made minimum fast, that's one place they would be called many times i na simple pattern. What am I doing wrong?

julia> let x = randn(1000)
         @btime minimum($x)
         @btime reduce(min, $x; init=Inf)   # (Using instead (x,y) -> min(x,y) gives identical time. Without init, same as minimum.)
         @btime reduce(_min, $x; init=Inf)  # PR 45532
         @btime reduce(__min, $x; init=Inf) # from N5N3, first version
       end;
  1.404 μs (0 allocations: 0 bytes)
  4.006 μs (0 allocations: 0 bytes)
  5.243 μs (0 allocations: 0 bytes)
  4.899 μs (0 allocations: 0 bytes)

NNlib avoids max in some functions which are commonly broadcasted. Perhaps surprising results:

julia> begin
         relu6_a(x) = min(max(0, x), 6)  # naiive
         relu6_b(x) = clamp(x, 0, 6)     # what NNlib does, faster
         relu6_c(x) = __min(__max(0.0, x), 6.0) 
       end;

julia> let x = randn(1000)
           y = similar(x)
           @btime $y .= relu6_a.($x)
           @btime $y .= relu6_b.($x)
           @btime $y .= relu6_c.($x)
       end;
  325.543 ns (0 allocations: 0 bytes)
  144.737 ns (0 allocations: 0 bytes)
  554.144 ns (0 allocations: 0 bytes)

@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 1, 2022

I am aware these algorithms perform very poorly for reductions. I don't have a good explanation as to why except that part of it is that they don't SIMD in reductions.

When looking at performance, be warned that minimum calls to reduce(min,...) (similar for max variants) and this goes through a specialization that does not use min at all (look for the min/max specialization of mapreduce_impl in base/reduce.jl). To evade it, use reduce((x,y)->min(x,y),...). Also note that this specialization inadvertently results in a massive 3x slowdown for integer minimum compared to the the non-specializing reduce((x,y)->min(x,y),...).

I have a different approach (related to the isless implementation) for float minimum/maximum that is more than 4x faster than what exists now and should avoid the integer slowdown. I will finalize it and open a PR when I get a chance.

@N5N3
Copy link
Member

N5N3 commented Jun 1, 2022

I have a different approach (related to the isless implementation) for float minimum/maximum that is more than 4x faster than what exists now and should avoid the integer slowdown. I will finalize it and open a PR when I get a chance.

That's great! I tried more thorough unroll to make minimum faster in #43725. But it increases TTFP. Hope we can close #31442 this time.

@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 1, 2022

Well, I try

function __min(x, y)
    diff = x - y
    min = ifelse(signbit(diff), x, y)
    ifelse(isnan(diff), diff, min)
end

The branch is still there but it saves one vmovapd %xmm1, %xmm3. (And it saves about 10% time on my 9600) broadcast!'s speed keeps same so I didn't check the IR. @mikmoore would you mind re-check it?

This version will yield the incorrect results __min(Inf,Inf)->NaN, __min(-Inf,-Inf)->NaN.

@N5N3
Copy link
Member

N5N3 commented Jun 2, 2022

LLVM already has a issue to track X86's minimum/maximum. llvm/llvm-project#53353
Since the optimizaiton is somehow platform dependent, I believe the best choice is use LLVM's {min,max}imum when it's available.
This should also enable auto-vectorlization and make foldl(min, a) as fast as reduce(min, a).

Therefore, I think it is more important to avoid possible degradation (as reported by @mcabbott). Although I can't reproduce that on my old x86...

@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 15, 2022

Having spent some time thinking about it, I think that the __min and __max proposed by @N5N3 are the best candidates. Those versions performed better than the original versions proposed in this PR in some situations and almost as well in others. But I think that if LLVM eventually manages to produce the code I wish it would, these will be indisputably the best.

It'd be nice if we can just use minmax(x,y) = min(x,y),max(x,y) and rely on inlining to deliver full efficiency, but we'll need to see whether that happens or if a manual definition is required.

@N5N3, you are managing #41709 which accomplishes everything this PR would. Would you like me to close this and you can move forward with that one or should I adopt your versions and continue this PR?

@N5N3
Copy link
Member

N5N3 commented Jun 22, 2022

I think keep working on #41709 seems simplier as it covered more cases. (In fact I think it's mergeable)
The regression on M1 looks strange to me. But we can fix them by multiversion like fma.
Ping @oscardssmith for advice.

@oscardssmith
Copy link
Member

Lets focus on the other PR then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants