-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<algorithm>: Optimize sample() and shuffle() with Lemire's algorithm
#5735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<algorithm>: Optimize sample() and shuffle() with Lemire's algorithm
#5735
Conversation
|
12th Gen Intel(R) Core(TM) i5-1235U (1.30 GHz) P-cores
E-cores
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Benchmark | Before | After | Speedup ------------------------------------------------------|------------|------------|-------- `bm_sample<uint8_t, alg_type::std_fn>/1048576/32768` | 4395155 ns | 3719749 ns | 1.18 `bm_sample<uint16_t, alg_type::std_fn>/1048576/32768` | 4422082 ns | 3740169 ns | 1.18 `bm_sample<uint32_t, alg_type::std_fn>/1048576/32768` | 4429234 ns | 3545233 ns | 1.25 `bm_sample<uint64_t, alg_type::std_fn>/1048576/32768` | 4421113 ns | 3564755 ns | 1.24 `bm_sample<uint8_t, alg_type::rng>/1048576/32768` | 4407065 ns | 3926217 ns | 1.12 `bm_sample<uint16_t, alg_type::rng>/1048576/32768` | 4408346 ns | 3531568 ns | 1.25 `bm_sample<uint32_t, alg_type::rng>/1048576/32768` | 4409194 ns | 3527145 ns | 1.25 `bm_sample<uint64_t, alg_type::rng>/1048576/32768` | 4454799 ns | 3556353 ns | 1.25 ------------------------------------------------------|------------|------------|----- `bm_shuffle<uint8_t, alg_type::std_fn>/1048576` | 5750192 ns | 3733152 ns | 1.54 `bm_shuffle<uint16_t, alg_type::std_fn>/1048576` | 5907846 ns | 4186912 ns | 1.41 `bm_shuffle<uint32_t, alg_type::std_fn>/1048576` | 6043149 ns | 4383343 ns | 1.38 `bm_shuffle<uint64_t, alg_type::std_fn>/1048576` | 6117446 ns | 4582720 ns | 1.33 `bm_shuffle<uint8_t, alg_type::rng>/1048576` | 5701609 ns | 3719451 ns | 1.53 `bm_shuffle<uint16_t, alg_type::rng>/1048576` | 5735311 ns | 4229480 ns | 1.36 `bm_shuffle<uint32_t, alg_type::rng>/1048576` | 5771077 ns | 4391066 ns | 1.31 `bm_shuffle<uint64_t, alg_type::rng>/1048576` | 5794527 ns | 4587342 ns | 1.26
…local constant.
…)`, `shuffle()`, `ranges::sample`, `ranges::shuffle`.
This was comparing `_Rem < _Index`, where `_Rem` is `_Udiff & _Udiff` (almost always `_Udiff`) while we have potentially signed `_Diff _Index`.
… __int64', signed/unsigned mismatch `_Urng::result_type` must be unsigned, but `result_type - result_type` performs the usual arithmetic conversions, promoting tiny types to int. Therefore, we need to `static_cast<_Udiff>`.
…quires a narrowing conversion Except for iterators with exotic difference types: For x64 algorithms, `_Diff` is naturally 64-bit, so `is_same_v<_Udiff, uint64_t>` is naturally taken. But for x86 algorithms, `_Diff` is naturally 32-bit. If the URBG is 32-bit (or smaller), then `_Udiff` will be `uint32_t`, and the `else` branch will be taken. In that case, the error correctly complains that we're using braces to convert from signed `_Diff _Index` to unsigned `_Uprod`, which is narrowing. We should directly `static_cast` instead.
Again, `(_Urng::max) () - (_Urng::min) ()` is `result_type - result_type`, subject to the usual arithmetic conversions. When compared against `_Udiff _Bmask_local`, this can emit signed/unsigned mismatch warnings. (This repro is x86-specific because MSVC emits warning C4018 at level 3 for `int < unsigned int`. `int < unsigned long long` emits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of `_Udiff` causes the warning to be x86-specific.) I'm avoiding extracting the `max - min` as a constant to avoid VSO-2580691 "`/analyze` emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".
519a98d to
01090a7
Compare
This comment was marked as resolved.
This comment was marked as resolved.
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
davidmrdavid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦕
Left 2 small comments/questions, non-blocking.
| using result_type = uint16_t; // N5014 [rand.req.urng]/3 | ||
| static constexpr result_type min() { | ||
| return 3; | ||
| } | ||
| static constexpr bool max() { | ||
| return true; | ||
| static constexpr result_type max() { | ||
| return 1729; | ||
| } | ||
| bool operator()() & { | ||
| return false; | ||
| result_type operator()() & { | ||
| return 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be triple sure - these numbers are completely arbitrary, right? I'd be a fan of calling that out, but I won't insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a compile-time only test, and the actual values involved are not relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a note to add comments to this generator in a followup PR.
🗺️ Overview
Exactly 3 years ago, @MattStephanson's #3012 implemented @lemire's algorithm "Fast Random Integer Generation in an Interval" for TR1
uniform_intand Standarduniform_int_distribution, shipped in VS 2022 17.5.This PR extends the use of the algorithm to
sample(),shuffle(),ranges::sample, andranges::shuffle. It's a behavioral change (the results of sampling and shuffling will be different), but the significant speedup is worth it.We have to fix some warnings in
_Rng_from_urng_v2because it was never previously exposed to signed difference types.My #5712 very recently split TR1
uniform_intfrom Standarduniform_int_distribution. The former still switches between my old_Rng_from_urngand the new Lemire-powered_Rng_from_urng_v2depending on whether it senses the presence ofstatic constexprmin()andmax()from the engine, as Lemire's algorithm greatly benefits from having them as compile-time constants. After #5712, the Standard distribution now assumes it's being used with Standard engines (as it would be highly inconsistent for a program to be mixing the Standard distribution with a TR1 engine; both spellings should be updated simultaneously).As
shuffle()was C++11 andsample()was C++17, it is slightly more conceivable for those algorithms to be used with old TR1 engines. For the moment, to preserve source compatibility, I am using the same "switch between v1 and v2" logic that TR1uniform_intuses (and that Standarduniform_int_distributionindirectly used between #3012 and #5712).In the near future, I plan to purge TR1 entirely, making use of v2 unconditional, but I want to land this improvement separately.
⚙️ Commits
sample()andshuffle()._Rng_from_urng_v2up to<algorithm>, no other changes._Has_static_min_maxup to<algorithm>, no other changes._Rng_from_urng_v1_or_v2._Rng_from_urng_v2:static constexpr=>constexprfor a local constant.min()andmax()must beconstexpr.result_type, must be unsigned integer._Rng_from_urng_v1_or_v2to optimizesample(),shuffle(),ranges::sample,ranges::shuffle._Rem < _Index, where_Remis_Udiff & _Udiff(almost always_Udiff) while we have potentially signed_Diff _Index.'int'to'unsigned __int64', signed/unsigned mismatch_Urng::result_typemust be unsigned, butresult_type - result_typeperforms the usual arithmetic conversions, promoting tiny types toint. Therefore, we need tostatic_cast<_Udiff>.'const int'to'unsigned __int64'requires a narrowing conversion_Diffis naturally 64-bit, sois_same_v<_Udiff, uint64_t>is naturally taken._Diffis naturally 32-bit. If the URBG is 32-bit (or smaller), then_Udiffwill beuint32_t, and theelsebranch will be taken._Diff _Indexto unsigned_Uprod, which is narrowing. We should directlystatic_castinstead.(_Urng::max) () - (_Urng::min) ()isresult_type - result_type, subject to the usual arithmetic conversions. When compared against_Udiff _Bmask_local, this can emit signed/unsigned mismatch warnings.int < unsigned int.int < unsigned long longemits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of_Udiffcauses the warning to be x86-specific.)max - minas a constant to avoid VSO-2580691 "/analyzeemits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".⏱️ Benchmark results
On my 5950X:
bm_sample<uint8_t, alg_type::std_fn>/1048576/32768bm_sample<uint16_t, alg_type::std_fn>/1048576/32768bm_sample<uint32_t, alg_type::std_fn>/1048576/32768bm_sample<uint64_t, alg_type::std_fn>/1048576/32768bm_sample<uint8_t, alg_type::rng>/1048576/32768bm_sample<uint16_t, alg_type::rng>/1048576/32768bm_sample<uint32_t, alg_type::rng>/1048576/32768bm_sample<uint64_t, alg_type::rng>/1048576/32768bm_shuffle<uint8_t, alg_type::std_fn>/1048576bm_shuffle<uint16_t, alg_type::std_fn>/1048576bm_shuffle<uint32_t, alg_type::std_fn>/1048576bm_shuffle<uint64_t, alg_type::std_fn>/1048576bm_shuffle<uint8_t, alg_type::rng>/1048576bm_shuffle<uint16_t, alg_type::rng>/1048576bm_shuffle<uint32_t, alg_type::rng>/1048576bm_shuffle<uint64_t, alg_type::rng>/1048576