Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorize min/max_element using SSE4.1 for floats #3928

Merged
merged 43 commits into from
Feb 6, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Aug 6, 2023

Resolves #2439

Enable for all modes. except /fp:except. Not sure yet about that one.

The /fp:strict and /fp:precise modes are fine, since the ultimate comparison is done via _mm_cmpeq_ps / _mm_cmpgt_ps, which are IEEE 754 compliant, and compare positive/negative zeros correctly as equal.

We still have unpredicted behavior on NaNs. @StephanTLavavej thinks we should not care about that. I agree:

  • [alg.sorting.general]/3 says "comp shall induce a strict weak ordering on the values", and with having both NaNs and non-NaNs it does not hold true
  • I don't imagine a practical application using it
  • Still, we can fix it, via _mm_cmpeq_ps / _mm_cmpeq_pd against self, and still win over the scalar implementation

I didn't use Uniform Distribution to directly form values, instead uses values from a pool that also has +/- infinity, and +/- zero -- thanks @statementreply for pointing this out when the test was updated in a preceding change.

No AVX yet. I think will do every SSE4,2 min / max, and just then look into AVX.

🏁 Perf benchmark

Without vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min>          20924 ns        21310 ns        34462
bm<float, 8021, Op::Max>          20930 ns        20508 ns        32000
bm<float, 8021, Op::Both>         19330 ns        19252 ns        37333
bm<double, 8021, Op::Min>         24329 ns        23926 ns        32000
bm<double, 8021, Op::Max>         21143 ns        20926 ns        29867
bm<double, 8021, Op::Both>        18338 ns        18415 ns        37333

With vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min>           3153 ns         3149 ns       213333
bm<float, 8021, Op::Max>           2544 ns         2567 ns       280000
bm<float, 8021, Op::Both>          3362 ns         3348 ns       224000
bm<double, 8021, Op::Min>          5439 ns         5469 ns       100000
bm<double, 8021, Op::Max>          5472 ns         5441 ns       112000
bm<double, 8021, Op::Both>         5725 ns         5781 ns       100000
Full results
Run on (12 X 2212.1 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)

Without vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>        21319 ns        20508 ns        32000
bm<uint8_t, 8021, Op::Max>        21082 ns        20403 ns        34462
bm<uint8_t, 8021, Op::Both>       28394 ns        28111 ns        34462
bm<uint16_t, 8021, Op::Min>       20949 ns        20647 ns        28000
bm<uint16_t, 8021, Op::Max>       18592 ns        18799 ns        40727
bm<uint16_t, 8021, Op::Both>      22267 ns        21972 ns        29867
bm<uint32_t, 8021, Op::Min>       15863 ns        16044 ns        44800
bm<uint32_t, 8021, Op::Max>       17405 ns        17439 ns        44800
bm<uint32_t, 8021, Op::Both>      18188 ns        17997 ns        37333
bm<uint64_t, 8021, Op::Min>       15894 ns        16009 ns        49778
bm<uint64_t, 8021, Op::Max>       15124 ns        15381 ns        49778
bm<uint64_t, 8021, Op::Both>      15280 ns        15381 ns        49778
bm<int8_t, 8021, Op::Min>         14517 ns        14648 ns        44800
bm<int8_t, 8021, Op::Max>         14876 ns        15067 ns        49778
bm<int8_t, 8021, Op::Both>        17899 ns        18032 ns        40727
bm<int16_t, 8021, Op::Min>        15042 ns        14997 ns        44800
bm<int16_t, 8021, Op::Max>        14740 ns        14648 ns        44800
bm<int16_t, 8021, Op::Both>       18460 ns        18799 ns        40727
bm<int32_t, 8021, Op::Min>        14978 ns        14753 ns        49778
bm<int32_t, 8021, Op::Max>        15132 ns        14997 ns        44800
bm<int32_t, 8021, Op::Both>       17335 ns        17439 ns        44800
bm<int64_t, 8021, Op::Min>        14960 ns        15067 ns        49778
bm<int64_t, 8021, Op::Max>        14861 ns        14997 ns        44800
bm<int64_t, 8021, Op::Both>       16467 ns        16497 ns        40727
bm<float, 8021, Op::Min>          20924 ns        21310 ns        34462
bm<float, 8021, Op::Max>          20930 ns        20508 ns        32000
bm<float, 8021, Op::Both>         19330 ns        19252 ns        37333
bm<double, 8021, Op::Min>         24329 ns        23926 ns        32000
bm<double, 8021, Op::Max>         21143 ns        20926 ns        29867
bm<double, 8021, Op::Both>        18338 ns        18415 ns        37333

With vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>         1289 ns         1256 ns       497778
bm<uint8_t, 8021, Op::Max>         1269 ns         1193 ns       497778
bm<uint8_t, 8021, Op::Both>        1508 ns         1475 ns       497778
bm<uint16_t, 8021, Op::Min>        2274 ns         2218 ns       373333
bm<uint16_t, 8021, Op::Max>        2116 ns         2040 ns       344615
bm<uint16_t, 8021, Op::Both>       1948 ns         1967 ns       373333
bm<uint32_t, 8021, Op::Min>        3025 ns         2982 ns       235789
bm<uint32_t, 8021, Op::Max>        3000 ns         2999 ns       224000
bm<uint32_t, 8021, Op::Both>       3423 ns         3376 ns       203636
bm<uint64_t, 8021, Op::Min>        7475 ns         7533 ns       112000
bm<uint64_t, 8021, Op::Max>        7261 ns         7394 ns       112000
bm<uint64_t, 8021, Op::Both>       9282 ns         9417 ns        74667
bm<int8_t, 8021, Op::Min>           893 ns          900 ns       746667
bm<int8_t, 8021, Op::Max>           835 ns          837 ns       896000
bm<int8_t, 8021, Op::Both>          962 ns          952 ns       640000
bm<int16_t, 8021, Op::Min>         1447 ns         1430 ns       448000
bm<int16_t, 8021, Op::Max>         1562 ns         1569 ns       448000
bm<int16_t, 8021, Op::Both>        2364 ns         2232 ns       448000
bm<int32_t, 8021, Op::Min>         3517 ns         3537 ns       172308
bm<int32_t, 8021, Op::Max>         2658 ns         2699 ns       248889
bm<int32_t, 8021, Op::Both>        2369 ns         2386 ns       248889
bm<int64_t, 8021, Op::Min>         5275 ns         5312 ns       100000
bm<int64_t, 8021, Op::Max>         5534 ns         5625 ns       100000
bm<int64_t, 8021, Op::Both>        6732 ns         6696 ns       112000
bm<float, 8021, Op::Min>           3153 ns         3149 ns       213333
bm<float, 8021, Op::Max>           2544 ns         2567 ns       280000
bm<float, 8021, Op::Both>          3362 ns         3348 ns       224000
bm<double, 8021, Op::Min>          5439 ns         5469 ns       100000
bm<double, 8021, Op::Max>          5472 ns         5441 ns       112000
bm<double, 8021, Op::Both>         5725 ns         5781 ns       100000

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner August 6, 2023 14:10
@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Aug 6, 2023
@StephanTLavavej

This comment was marked as resolved.

and remove extra coverage
@AlexGuteniev

This comment was marked as resolved.

@frederick-vs-ja

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

Keep the fix within minmax_element.cpp, though.
@StephanTLavavej
Copy link
Member

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@@ -1089,6 +1351,16 @@ const void* __stdcall __std_min_element_8(
return _Minmax_element<_Mode_min, _Minmax_traits_8>(_First, _Last, _Signed);
}

const void* __stdcall __std_min_element_f(
const void* const _First, const void* const _Last, const bool _Unused) noexcept {
return _Minmax_element<_Mode_min, _Minmax_traits_f>(_First, _Last, _Unused);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why we have these _Unused arguments which are, in fact, used. This seems confusing enough to warrant an explanation in the source. I see that _Minmax_element needs a third argument, but I don't get why we need to pass a false in from the header-only code, nor why we need to flow that argument into _Minmax_element instead of passing a literal false here to enable constant propagation.

Is this intended for future use? This is always-statically-linked code so there's no need to allow for old binaries to work with new headers AFAICS.

EDIT: Ah, I see. A comment here directing the reader to the explanation on _Minmax_element would be nice, but totally not worth resetting testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the same journey of discovery, I can add a small comment in a followup PR.

@StephanTLavavej StephanTLavavej merged commit 192a840 into microsoft:main Feb 6, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for maximizing the performance and minimizing the time taken for these algorithms with everyone's favorite types! 🚀 😻 🎁

@jschroedl
Copy link

Re: handling NaNs. This change has broken code which does pass NAN in the data. Applications can use this to indicate "missing" data. In the past, they were skipped. Now, it is a crash. We can craft a custom Compare function for std::min_element or for std::max_element but not for std::minmax_element.

@AlexGuteniev
Copy link
Contributor Author

This change has broken code which does pass NAN in the data

Can you make a more detailed report, with your data, expected results, actual results.
For now I'm not sure whether it is UB, or a valid case.

@AlexGuteniev
Copy link
Contributor Author

@jschroedl , I've managed to recreate this on randomized test.

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Note tha mixing NaN with few other numbers in input is an undefined behavior, as such input violates strict weak ordering.
All NaN input or NaN and only one other value input appears to be valid though.

@frederick-vs-ja
Copy link
Contributor

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

@AlexGuteniev
Copy link
Contributor Author

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

Isn't [alg.sorting]/3 covering this on a higher level?

@AlexGuteniev
Copy link
Contributor Author

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Or you can define _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 to skip only this vectorization, but keep others.

@frederick-vs-ja
Copy link
Contributor

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

Isn't [alg.sorting]/3 covering this on a higher level?

Thanks for correction, and it's sure that UB is also present std::meow_element.
(Off-topic: it's may be a defect that currently [alg.sorting.general]/3 doesn't seem applied to ranges versions of algorithms in [alg.binary.search]. Thanks for notification!)

@jschroedl
Copy link

jschroedl commented Jun 18, 2024

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Or you can define _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 to skip only this vectorization, but keep others.

The scenario of having NaN in collections of double is not so far-fetched as it can be good indications during telemetry or data collection as a sensor going offline, for example. I'm sure you can imagine other scenarios. We have used NaN to indicate 'missing' data for many many years and portably (mac, win) have been able to use these algorithms and have NaN skipped so our use of this is historical. Clearly, we'll need to revisit this usage.

As for the crash, a simple test case I have is the following will crash now whereas prior it would return 0, 99 for min, max.

   std::vector<double> data { 0, 1, NAN, 98, 99 };
   double dMin = *std::min_element(begin(data),end(data));
   std::cout << "\nMin is " << dMin;

   double dMax = *std::max_element(begin(data), end(data));
   std::cout << "\nMax is " << dMax;

Thank you for the workaround flag, I will investigate using that.

@AlexGuteniev
Copy link
Contributor Author

Thanks for providing the test case.
Looks like the circumstances of the bug is different from what I thought initially - it crashes inside the algorithm instead of returning a bogus index.

Note that your repro has UB, as @frederick-vs-ja and me have discussed just above.

The non-UB equivalent repro looks like this to me:

#include <iostream>
#include <vector>
#include <limits>

int main()
{

	std::vector<double> data{ 1, 1, NAN, 1, 1 };
	double dMin = *std::min_element(begin(data), end(data));
	std::cout << "\nMin is " << dMin;

	double dMax = *std::max_element(begin(data), end(data));
	std::cout << "\nMax is " << dMax;
}

I intend to fix this to output the following:

Min is 1
Max is 1

But I do not intend to fix your example, it may or may not be fixed.

As a proper fix on your side, I suggest having hand-written loops instead of using STL algorithms to avoid undefined behavior.
As a quick fix, defining _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 should do.

@AlexGuteniev
Copy link
Contributor Author

@jschroedl , the issue you reported resloved as wontfix -- see #4731

However it made me to look into +0.0 / -0.0 situation, for which I found and fixed bugs in #4734 , and it happens to fix your repro as well, though it might not fix all cases, as I didn't do thorough testing.

@matthiasstraka
Copy link

@AlexGuteniev, what makes things worse with crashes on NaNs is that it changes existing behavior (<= MSVC 17.9) and causes a crash that can't even be caught via a catch (...) (see https://gcc.godbolt.org/z/qzveTd8oK). We had an issue with std::minmax_element that suddenly crashed an otherwise working program after an upgrade to 17.10 and if this change persists we need to re-evaluate any *_element code that handles user data (which may well contain NaNs and is valid data in our case).
We were lucky to find this issue during testing, but it may have easily be overlooked and first appeared at a customer - maybe in a critical use case. IMHO, an optimization that breaks existing behavior and potentially crashes should only be allowed via explicit compile-time flags/defines. Maybe it should require to use #define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1 explicitly.

@AlexGuteniev
Copy link
Contributor Author

@matthiasstraka ,

I admit I didn't expect crashes on NANs when doing this change, only wrong results.
Still I'm not sure if the optimization should be disabled or removed, and I don't have the authority to decide this.


Generally, optimizations indeed shouldn't break existing code, especially into a runtime crash.
However, if the existing code was incorrect and just happened to work, then breaking it may be acceptable.

This reminds me of a precedent when on one of platforms memcpy was defined as memmove, and then there was an optimization to take advantage of non-oevrlapped ranges, and it broke uses of memcpy with overlapped ranges. The optimization wasn't incorrect, as memcpy is specifically distinct from memmove to enable such optimizations.

The case with _element is only different from the memcpy precedent is only in that I don't know if the undefined behavior for non-strict-weak-ordered values exists to enable optimization, or for other reasons. But then again, there are precedents of using undefined behavior existing for other reasons, like signed integer overflow optimizations.

So I generally don't think that the breakage of some not well defined code necessarily justifies the revert of the optimization.


Still I think in some cases breaking of not well defined code may be fixed:

  1. If the break is massive, everyone was doing the right thing, then maybe should avoid such impact, and keep the existing code working.
  2. If it is possible to achieve the desired or close to the desired effect without breaking the old code, maybe should do that

For 1 I don't have figures. So far we have at least few complains. Is this massive? Probably not yet.
For 2 -- I think it may be possible.

What currently prevents me from submitting a fix which keeps the optimization, but makes NAN values don't crash it is:

  • The maintainers' position on the issue <algorithm>: minmax family are returning wrong match index and may crash #4731 (comment)

    We talked about this at the weekly maintainer meeting and decided that the cases of all-nans or all-nans-except-one-plain-value aren't worth worrying about, and the Standard is vague on whether they're UB.

    I find this reasonable from some point, because it doesn't encourage not well defined code and avoid spending resources to non-goals

  • The pull request for +0.0 / -0.0 fix Floating minmax: fix negative zero handling and dedicated test coverage for arrays of +0.0 and -0.0 only #4734 is in review. and before it is accepted or rejected, it is counter-productive to work on similar changes is the same area, having to duplicate some pieces, and creating conflicts to resolve. Note that currently the +0.0 / - 0.0 fix as it is currently written may fix the NAN crash too. It fixes at least some of occurrences, I just don't have exhaustive coverage to say for sure that it fixes all of them.

@matthiasstraka
Copy link

Since we are working with the C++17 standard at the moment, I looked up the C++17 standard-definition of max_element (§ 28.7.8):

template constexpr ForwardIterator max_element(ForwardIterator first, ForwardIterator last);
Returns: The first iterator i in the range [first, last) such that for every iterator j in the range [first, last) the following corresponding conditions hold: !(*i < *j) or comp(*i, *j) == false. Returns last if first == last.

This of course has the consequence, that if the first element of the range is NaN, a valid max cannot be found. But it is nevertheless well-defined behavior and should not lead to unexpected crashes. I'm not sure if C++20 allows to change that by requiring the indirect_strict_weak_order concept. But as I see it, C++17 and earlier does not require us to expect undefined behavior.

@AlexGuteniev
Copy link
Contributor Author

It is undefined behavior according to C++17 you linked as well.
See "28.7 Sorting and related operations", paragraphs 1 though 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX for floats
6 participants