vectorize `min/max_element` using SSE4.1 for floats #3928

AlexGuteniev · 2023-08-06T14:10:28Z

Resolves #2439

Enable for all modes. except /fp:except. Not sure yet about that one.

The /fp:strict and /fp:precise modes are fine, since the ultimate comparison is done via _mm_cmpeq_ps / _mm_cmpgt_ps, which are IEEE 754 compliant, and compare positive/negative zeros correctly as equal.

We still have unpredicted behavior on NaNs. @StephanTLavavej thinks we should not care about that. I agree:

[alg.sorting.general]/3 says "comp shall induce a strict weak ordering on the values", and with having both NaNs and non-NaNs it does not hold true
I don't imagine a practical application using it
Still, we can fix it, via _mm_cmpeq_ps / _mm_cmpeq_pd against self, and still win over the scalar implementation

I didn't use Uniform Distribution to directly form values, instead uses values from a pool that also has +/- infinity, and +/- zero -- thanks @statementreply for pointing this out when the test was updated in a preceding change.

No AVX yet. I think will do every SSE4,2 min / max, and just then look into AVX.

🏁 Perf benchmark

Without vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min>          20924 ns        21310 ns        34462
bm<float, 8021, Op::Max>          20930 ns        20508 ns        32000
bm<float, 8021, Op::Both>         19330 ns        19252 ns        37333
bm<double, 8021, Op::Min>         24329 ns        23926 ns        32000
bm<double, 8021, Op::Max>         21143 ns        20926 ns        29867
bm<double, 8021, Op::Both>        18338 ns        18415 ns        37333

With vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min>           3153 ns         3149 ns       213333
bm<float, 8021, Op::Max>           2544 ns         2567 ns       280000
bm<float, 8021, Op::Both>          3362 ns         3348 ns       224000
bm<double, 8021, Op::Min>          5439 ns         5469 ns       100000
bm<double, 8021, Op::Max>          5472 ns         5441 ns       112000
bm<double, 8021, Op::Both>         5725 ns         5781 ns       100000

Full results

Run on (12 X 2212.1 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)

Without vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>        21319 ns        20508 ns        32000
bm<uint8_t, 8021, Op::Max>        21082 ns        20403 ns        34462
bm<uint8_t, 8021, Op::Both>       28394 ns        28111 ns        34462
bm<uint16_t, 8021, Op::Min>       20949 ns        20647 ns        28000
bm<uint16_t, 8021, Op::Max>       18592 ns        18799 ns        40727
bm<uint16_t, 8021, Op::Both>      22267 ns        21972 ns        29867
bm<uint32_t, 8021, Op::Min>       15863 ns        16044 ns        44800
bm<uint32_t, 8021, Op::Max>       17405 ns        17439 ns        44800
bm<uint32_t, 8021, Op::Both>      18188 ns        17997 ns        37333
bm<uint64_t, 8021, Op::Min>       15894 ns        16009 ns        49778
bm<uint64_t, 8021, Op::Max>       15124 ns        15381 ns        49778
bm<uint64_t, 8021, Op::Both>      15280 ns        15381 ns        49778
bm<int8_t, 8021, Op::Min>         14517 ns        14648 ns        44800
bm<int8_t, 8021, Op::Max>         14876 ns        15067 ns        49778
bm<int8_t, 8021, Op::Both>        17899 ns        18032 ns        40727
bm<int16_t, 8021, Op::Min>        15042 ns        14997 ns        44800
bm<int16_t, 8021, Op::Max>        14740 ns        14648 ns        44800
bm<int16_t, 8021, Op::Both>       18460 ns        18799 ns        40727
bm<int32_t, 8021, Op::Min>        14978 ns        14753 ns        49778
bm<int32_t, 8021, Op::Max>        15132 ns        14997 ns        44800
bm<int32_t, 8021, Op::Both>       17335 ns        17439 ns        44800
bm<int64_t, 8021, Op::Min>        14960 ns        15067 ns        49778
bm<int64_t, 8021, Op::Max>        14861 ns        14997 ns        44800
bm<int64_t, 8021, Op::Both>       16467 ns        16497 ns        40727
bm<float, 8021, Op::Min>          20924 ns        21310 ns        34462
bm<float, 8021, Op::Max>          20930 ns        20508 ns        32000
bm<float, 8021, Op::Both>         19330 ns        19252 ns        37333
bm<double, 8021, Op::Min>         24329 ns        23926 ns        32000
bm<double, 8021, Op::Max>         21143 ns        20926 ns        29867
bm<double, 8021, Op::Both>        18338 ns        18415 ns        37333

With vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>         1289 ns         1256 ns       497778
bm<uint8_t, 8021, Op::Max>         1269 ns         1193 ns       497778
bm<uint8_t, 8021, Op::Both>        1508 ns         1475 ns       497778
bm<uint16_t, 8021, Op::Min>        2274 ns         2218 ns       373333
bm<uint16_t, 8021, Op::Max>        2116 ns         2040 ns       344615
bm<uint16_t, 8021, Op::Both>       1948 ns         1967 ns       373333
bm<uint32_t, 8021, Op::Min>        3025 ns         2982 ns       235789
bm<uint32_t, 8021, Op::Max>        3000 ns         2999 ns       224000
bm<uint32_t, 8021, Op::Both>       3423 ns         3376 ns       203636
bm<uint64_t, 8021, Op::Min>        7475 ns         7533 ns       112000
bm<uint64_t, 8021, Op::Max>        7261 ns         7394 ns       112000
bm<uint64_t, 8021, Op::Both>       9282 ns         9417 ns        74667
bm<int8_t, 8021, Op::Min>           893 ns          900 ns       746667
bm<int8_t, 8021, Op::Max>           835 ns          837 ns       896000
bm<int8_t, 8021, Op::Both>          962 ns          952 ns       640000
bm<int16_t, 8021, Op::Min>         1447 ns         1430 ns       448000
bm<int16_t, 8021, Op::Max>         1562 ns         1569 ns       448000
bm<int16_t, 8021, Op::Both>        2364 ns         2232 ns       448000
bm<int32_t, 8021, Op::Min>         3517 ns         3537 ns       172308
bm<int32_t, 8021, Op::Max>         2658 ns         2699 ns       248889
bm<int32_t, 8021, Op::Both>        2369 ns         2386 ns       248889
bm<int64_t, 8021, Op::Min>         5275 ns         5312 ns       100000
bm<int64_t, 8021, Op::Max>         5534 ns         5625 ns       100000
bm<int64_t, 8021, Op::Both>        6732 ns         6696 ns       112000
bm<float, 8021, Op::Min>           3153 ns         3149 ns       213333
bm<float, 8021, Op::Max>           2544 ns         2567 ns       280000
bm<float, 8021, Op::Both>          3362 ns         3348 ns       224000
bm<double, 8021, Op::Min>          5439 ns         5469 ns       100000
bm<double, 8021, Op::Max>          5472 ns         5441 ns       112000
bm<double, 8021, Op::Both>         5725 ns         5781 ns       100000

Resolves microsoft#2439

and remove extra coverage

See microsoft#3931

Keep the fix within minmax_element.cpp, though.

StephanTLavavej · 2024-02-05T18:18:41Z

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

CaseyCarter · 2024-02-05T22:31:55Z

stl/src/vector_algorithms.cpp

@@ -1089,6 +1351,16 @@ const void* __stdcall __std_min_element_8(
    return _Minmax_element<_Mode_min, _Minmax_traits_8>(_First, _Last, _Signed);
 }

+const void* __stdcall __std_min_element_f(
+    const void* const _First, const void* const _Last, const bool _Unused) noexcept {
+    return _Minmax_element<_Mode_min, _Minmax_traits_f>(_First, _Last, _Unused);


I have no idea why we have these _Unused arguments which are, in fact, used. This seems confusing enough to warrant an explanation in the source. I see that _Minmax_element needs a third argument, but I don't get why we need to pass a false in from the header-only code, nor why we need to flow that argument into _Minmax_element instead of passing a literal false here to enable constant propagation.

Is this intended for future use? This is always-statically-linked code so there's no need to allow for old binaries to work with new headers AFAICS.

EDIT: Ah, I see. A comment here directing the reader to the explanation on _Minmax_element would be nice, but totally not worth resetting testing.

I went through the same journey of discovery, I can add a small comment in a followup PR.

StephanTLavavej · 2024-02-06T10:58:06Z

Thanks for maximizing the performance and minimizing the time taken for these algorithms with everyone's favorite types! 🚀 😻 🎁

jschroedl · 2024-06-17T19:10:57Z

Re: handling NaNs. This change has broken code which does pass NAN in the data. Applications can use this to indicate "missing" data. In the past, they were skipped. Now, it is a crash. We can craft a custom Compare function for std::min_element or for std::max_element but not for std::minmax_element.

AlexGuteniev · 2024-06-17T21:15:50Z

This change has broken code which does pass NAN in the data

Can you make a more detailed report, with your data, expected results, actual results.
For now I'm not sure whether it is UB, or a valid case.

AlexGuteniev · 2024-06-18T06:49:28Z

@jschroedl , I've managed to recreate this on randomized test.

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Note tha mixing NaN with few other numbers in input is an undefined behavior, as such input violates strict weak ordering.
All NaN input or NaN and only one other value input appears to be valid though.

frederick-vs-ja · 2024-06-18T07:11:15Z

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

AlexGuteniev · 2024-06-18T07:23:11Z

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

Isn't [alg.sorting]/3 covering this on a higher level?

AlexGuteniev · 2024-06-18T08:12:17Z

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Or you can define _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 to skip only this vectorization, but keep others.

frederick-vs-ja · 2024-06-18T08:33:08Z

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

Isn't [alg.sorting]/3 covering this on a higher level?

Thanks for correction, and it's sure that UB is also present std::meow_element.
(Off-topic: it's may be a defect that currently [alg.sorting.general]/3 doesn't seem applied to ranges versions of algorithms in [alg.binary.search]. Thanks for notification!)

jschroedl · 2024-06-18T13:23:13Z

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Or you can define _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 to skip only this vectorization, but keep others.

The scenario of having NaN in collections of double is not so far-fetched as it can be good indications during telemetry or data collection as a sensor going offline, for example. I'm sure you can imagine other scenarios. We have used NaN to indicate 'missing' data for many many years and portably (mac, win) have been able to use these algorithms and have NaN skipped so our use of this is historical. Clearly, we'll need to revisit this usage.

As for the crash, a simple test case I have is the following will crash now whereas prior it would return 0, 99 for min, max.

   std::vector<double> data { 0, 1, NAN, 98, 99 };
   double dMin = *std::min_element(begin(data),end(data));
   std::cout << "\nMin is " << dMin;

   double dMax = *std::max_element(begin(data), end(data));
   std::cout << "\nMax is " << dMax;

Thank you for the workaround flag, I will investigate using that.

AlexGuteniev · 2024-06-18T14:09:29Z

Thanks for providing the test case.
Looks like the circumstances of the bug is different from what I thought initially - it crashes inside the algorithm instead of returning a bogus index.

Note that your repro has UB, as @frederick-vs-ja and me have discussed just above.

The non-UB equivalent repro looks like this to me:

#include <iostream>
#include <vector>
#include <limits>

int main()
{

	std::vector<double> data{ 1, 1, NAN, 1, 1 };
	double dMin = *std::min_element(begin(data), end(data));
	std::cout << "\nMin is " << dMin;

	double dMax = *std::max_element(begin(data), end(data));
	std::cout << "\nMax is " << dMax;
}

I intend to fix this to output the following:

Min is 1
Max is 1

But I do not intend to fix your example, it may or may not be fixed.

As a proper fix on your side, I suggest having hand-written loops instead of using STL algorithms to avoid undefined behavior.
As a quick fix, defining _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 should do.

AlexGuteniev · 2024-06-20T18:52:27Z

@jschroedl , the issue you reported resloved as wontfix -- see #4731

However it made me to look into +0.0 / -0.0 situation, for which I found and fixed bugs in #4734 , and it happens to fix your repro as well, though it might not fix all cases, as I didn't do thorough testing.

matthiasstraka · 2024-08-01T20:43:33Z

@AlexGuteniev, what makes things worse with crashes on NaNs is that it changes existing behavior (<= MSVC 17.9) and causes a crash that can't even be caught via a catch (...) (see https://gcc.godbolt.org/z/qzveTd8oK). We had an issue with std::minmax_element that suddenly crashed an otherwise working program after an upgrade to 17.10 and if this change persists we need to re-evaluate any *_element code that handles user data (which may well contain NaNs and is valid data in our case).
We were lucky to find this issue during testing, but it may have easily be overlooked and first appeared at a customer - maybe in a critical use case. IMHO, an optimization that breaks existing behavior and potentially crashes should only be allowed via explicit compile-time flags/defines. Maybe it should require to use #define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1 explicitly.

AlexGuteniev · 2024-08-02T05:27:25Z

@matthiasstraka ,

I admit I didn't expect crashes on NANs when doing this change, only wrong results.
Still I'm not sure if the optimization should be disabled or removed, and I don't have the authority to decide this.

Generally, optimizations indeed shouldn't break existing code, especially into a runtime crash.
However, if the existing code was incorrect and just happened to work, then breaking it may be acceptable.

This reminds me of a precedent when on one of platforms memcpy was defined as memmove, and then there was an optimization to take advantage of non-oevrlapped ranges, and it broke uses of memcpy with overlapped ranges. The optimization wasn't incorrect, as memcpy is specifically distinct from memmove to enable such optimizations.

The case with _element is only different from the memcpy precedent is only in that I don't know if the undefined behavior for non-strict-weak-ordered values exists to enable optimization, or for other reasons. But then again, there are precedents of using undefined behavior existing for other reasons, like signed integer overflow optimizations.

So I generally don't think that the breakage of some not well defined code necessarily justifies the revert of the optimization.

Still I think in some cases breaking of not well defined code may be fixed:

If the break is massive, everyone was doing the right thing, then maybe should avoid such impact, and keep the existing code working.
If it is possible to achieve the desired or close to the desired effect without breaking the old code, maybe should do that

For 1 I don't have figures. So far we have at least few complains. Is this massive? Probably not yet.
For 2 -- I think it may be possible.

What currently prevents me from submitting a fix which keeps the optimization, but makes NAN values don't crash it is:

The maintainers' position on the issue <algorithm>: minmax family are returning wrong match index and may crash #4731 (comment)

We talked about this at the weekly maintainer meeting and decided that the cases of all-nans or all-nans-except-one-plain-value aren't worth worrying about, and the Standard is vague on whether they're UB.

I find this reasonable from some point, because it doesn't encourage not well defined code and avoid spending resources to non-goals
The pull request for +0.0 / -0.0 fix Floating minmax: fix negative zero handling and dedicated test coverage for arrays of +0.0 and -0.0 only #4734 is in review. and before it is accepted or rejected, it is counter-productive to work on similar changes is the same area, having to duplicate some pieces, and creating conflicts to resolve. Note that currently the +0.0 / - 0.0 fix as it is currently written may fix the NAN crash too. It fixes at least some of occurrences, I just don't have exhaustive coverage to say for sure that it fixes all of them.

matthiasstraka · 2024-08-02T06:58:24Z

Since we are working with the C++17 standard at the moment, I looked up the C++17 standard-definition of max_element (§ 28.7.8):

template constexpr ForwardIterator max_element(ForwardIterator first, ForwardIterator last);
Returns: The first iterator i in the range [first, last) such that for every iterator j in the range [first, last) the following corresponding conditions hold: !(*i < *j) or comp(*i, *j) == false. Returns last if first == last.

This of course has the consequence, that if the first element of the range is NaN, a valid max cannot be found. But it is nevertheless well-defined behavior and should not lead to unexpected crashes. I'm not sure if C++20 allows to change that by requiring the indirect_strict_weak_order concept. But as I see it, C++17 and earlier does not require us to expect undefined behavior.

AlexGuteniev · 2024-08-02T08:07:17Z

It is undefined behavior according to C++17 you linked as well.
See "28.7 Sorting and related operations", paragraphs 1 though 4.

vectorize min/max_element using SSE4.1/AVX for floats

cd7420a

Resolves microsoft#2439

AlexGuteniev requested a review from a team as a code owner August 6, 2023 14:10

AlexGuteniev added 4 commits August 6, 2023 17:36

const

80b561c

reverse init

86b3100

format

55c5add

more interesting values

c88fe9a

This comment was marked as resolved.

Sign in to view

AlexGuteniev added 3 commits August 6, 2023 18:41

fix x68 build

8f70406

format

52c8191

coverage

af9c9b5

This comment was marked as resolved.

Sign in to view

AlexGuteniev added 2 commits August 6, 2023 19:46

ouch

ff97c5b

format

90b7999

StephanTLavavej added the performance Must go faster label Aug 6, 2023

This comment was marked as resolved.

Sign in to view

include /fp:strict / /fp:precise

568a793

and remove extra coverage

AlexGuteniev added 3 commits August 6, 2023 23:38

-extra casts

7f2a635

copypaste error

f6b24a9

more interesting input

91ed8a3

AlexGuteniev mentioned this pull request Aug 7, 2023

Are long doubles with different implementations supported? #3931

Closed

AlexGuteniev added 3 commits August 7, 2023 12:46

Unsupport 80-bit long double

46baf22

See microsoft#3931

+benchmark

83d208f

include order

1be219f

This comment was marked as resolved.

Sign in to view

AlexGuteniev added 3 commits August 7, 2023 16:55

-copy

99b1746

simplify benchmark

58fc6b9

fix build

eb388ad

This comment was marked as resolved.

Sign in to view

StephanTLavavej approved these changes Jan 31, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

Revert enabling warnings for benchmarks.

358dd22

Keep the fix within minmax_element.cpp, though.

StephanTLavavej approved these changes Jan 31, 2024

View reviewed changes

StephanTLavavej mentioned this pull request Jan 31, 2024

Enable warnings for the benchmarks, and fix them #4356

Merged

StephanTLavavej self-assigned this Feb 5, 2024

CaseyCarter approved these changes Feb 6, 2024

View reviewed changes

StephanTLavavej merged commit 192a840 into microsoft:main Feb 6, 2024
35 checks passed

AlexGuteniev deleted the guess_whos_back branch February 6, 2024 11:19

StephanTLavavej mentioned this pull request Feb 15, 2024

Toolset update: VS 2022 17.10 Preview 1 #4392

Merged

AlexGuteniev mentioned this pull request Jun 18, 2024

<algorithm>: minmax family are returning wrong match index and may crash #4731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorize `min/max_element` using SSE4.1 for floats #3928

vectorize `min/max_element` using SSE4.1 for floats #3928

AlexGuteniev commented Aug 6, 2023 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Feb 5, 2024

CaseyCarter Feb 5, 2024

StephanTLavavej Feb 6, 2024

StephanTLavavej commented Feb 6, 2024

jschroedl commented Jun 17, 2024

AlexGuteniev commented Jun 17, 2024

AlexGuteniev commented Jun 18, 2024

frederick-vs-ja commented Jun 18, 2024

AlexGuteniev commented Jun 18, 2024

AlexGuteniev commented Jun 18, 2024

frederick-vs-ja commented Jun 18, 2024

jschroedl commented Jun 18, 2024 •

edited

Loading

AlexGuteniev commented Jun 18, 2024

AlexGuteniev commented Jun 20, 2024

matthiasstraka commented Aug 1, 2024

AlexGuteniev commented Aug 2, 2024

matthiasstraka commented Aug 2, 2024

AlexGuteniev commented Aug 2, 2024

vectorize min/max_element using SSE4.1 for floats #3928

vectorize min/max_element using SSE4.1 for floats #3928

Conversation

AlexGuteniev commented Aug 6, 2023 • edited Loading

🏁 Perf benchmark

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Feb 5, 2024

CaseyCarter Feb 5, 2024

Choose a reason for hiding this comment

StephanTLavavej Feb 6, 2024

Choose a reason for hiding this comment

StephanTLavavej commented Feb 6, 2024

jschroedl commented Jun 17, 2024

AlexGuteniev commented Jun 17, 2024

AlexGuteniev commented Jun 18, 2024

frederick-vs-ja commented Jun 18, 2024

AlexGuteniev commented Jun 18, 2024

AlexGuteniev commented Jun 18, 2024

frederick-vs-ja commented Jun 18, 2024

jschroedl commented Jun 18, 2024 • edited Loading

AlexGuteniev commented Jun 18, 2024

AlexGuteniev commented Jun 20, 2024

matthiasstraka commented Aug 1, 2024

AlexGuteniev commented Aug 2, 2024

matthiasstraka commented Aug 2, 2024

AlexGuteniev commented Aug 2, 2024

vectorize `min/max_element` using SSE4.1 for floats #3928

vectorize `min/max_element` using SSE4.1 for floats #3928

AlexGuteniev commented Aug 6, 2023 •

edited

Loading

jschroedl commented Jun 18, 2024 •

edited

Loading