-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<random>
: Optimize uniform_int_distribution
with multiply-shift
#178
Comments
I would like to work on this. In the hopes to not waste everyone's time with many needless iterations, I have a few questions.
static constexpr bool _Is_power_of2(_Udiff _Value) {
return _Value && !(_Value & (_Value - 1));
}
_Diff operator()(_Diff _Index) {
if (_Is_power_of2(_Index)) {
...
return static_cast<_Diff>(_Reduce_by_power2(_Ret, _Index));
}
for (;;) { ...
if (_Ret / _Index < _Mask / _Index) {
return static_cast<_Diff>(_Ret % _Index);
}
}
} On a site note, am I correct in assuming that |
There will eventually be, but there is none yet. For now all we can say is "try to imitate the style that's already there, and be prepared for us to communicate much of the style to you in code reviews".
It depends on whether hardware 64-bit division is faster than a bespoke 64 x 64 = 128 bit multiply (See
In general, more and faster is preferable to less and slower but there are no absolutes. (Your example certainly looks reasonable.)
|
Thanks for the clarification. I'll check which solution for the 64-bit case is faster and create a PR. With regard to testing, I assume that additional testcases for the changed code and where to put them can be discussed in the review and can be extended once the test-suites are also available here? |
Thanks for looking into this! Based on my experience with charconv/Ryu, I'm pretty sure that umul128 will be great on x64, and decent on x86 compared to runtime division (which was horrible until very recent architectures, where it is less horrible).
Yeah. We're hopefully about a month away from getting our test suites online. Note that this enhancement is tagged vNext because I believe that it will break ABI - specifically, changing the observable behavior of (Unless the new technique happens to generate exactly the same values as the current technique, while storing the exact same data members, which seems unlikely.) |
I did some work on the 64-bit path and think I found a pretty neat solution to work around the 128-bit operations. The "divisor" in the multiply-shift operation is a power of two, which allows us to integrate its into the shift-value. Determining log2 of a known power of 2 and then shifting is pretty fast when compared to 128-bit operations. A rough test showed log2 to be ~60% faster. By the way, are there specific guidelines for comments? I'm a bit unsure how detailed it should be. Below is the adaption of the multiply-shift operation. Any suggestions for improvement or better style are welcome. :) static constexpr size_t _Log2OfPwr2(uint64_t __power_two_val) noexcept {
size_t __log_val = (((__power_two_val & 0x00000000FFFFFFFFULL) != 0) - 1) & 32;
__log_val += (((__power_two_val & 0x0000FFFF0000FFFFULL) != 0) - 1) & 16;
__log_val += (((__power_two_val & 0x00FF00FF00FF00FFULL) != 0) - 1) & 8;
__log_val += (((__power_two_val & 0x0F0F0F0F0F0F0F0FULL) != 0) - 1) & 4;
__log_val += (((__power_two_val & 0x3333333333333333ULL) != 0) - 1) & 2;
__log_val += (((__power_two_val & 0x5555555555555555ULL) != 0) - 1) & 1;
return __log_val;
}
static constexpr _Udiff _Reduce_by_power2(_Udiff _Value, _Udiff _Mod) noexcept {
constexpr auto _Udiff_bits = CHAR_BIT * sizeof(_Udiff);
if constexpr (sizeof(_Udiff) < sizeof(unsigned int)) {
// avoid unnecessary promotion for smaller types
return (_Value * _Mod) >> _Udiff_bits;
} else if constexpr (sizeof(_Udiff) == sizeof(unsigned int)) {
// ensure promotion to ull to keep all bits
return static_cast<_Udiff>((static_cast<unsigned long long>(_Value) * _Mod) >> _Udiff_bits);
}
// modification of multiply-shift to avoid costly expansion to 128-bit
return _Value >> (_Udiff_bits - _Log2OfPwr2(_Mod));
}
static constexpr bool _Is_power_of2(_Udiff _Value) noexcept {
return _Value && !(_Value & (_Value - 1));
}
_Diff operator()(_Diff _Index) { // adapt _Urng closed range to [0, _Index)
if (_Is_power_of2(_Index)) {
_Udiff _Ret = 0; // random bits
_Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]
while (_Mask < _Udiff(_Index - 1)) { // need more random bits
_Ret <<= _Bits - 1; // avoid full shift
_Ret <<= 1;
_Ret |= _Get_bits();
_Mask <<= _Bits - 1; // avoid full shift
_Mask <<= 1;
_Mask |= _Bmask;
}
// _Ret is [0, _Mask], _Index - 1 <= _Mask
return static_cast<_Diff>(_Reduce_by_power2(_Ret, _Index));
}
for (;;) { // try a sample random value
_Udiff _Ret = 0; // random bits
_Udiff _Mask = 0; // 2^N - 1, _Ret is within [0, _Mask]
while (_Mask < _Udiff(_Index - 1)) { // need more random bits
_Ret <<= _Bits - 1; // avoid full shift
_Ret <<= 1;
_Ret |= _Get_bits();
_Mask <<= _Bits - 1; // avoid full shift
_Mask <<= 1;
_Mask |= _Bmask;
}
// _Ret is [0, _Mask], _Index - 1 <= _Mask, return if unbiased
if (_Ret / _Index < _Mask / _Index) {
return static_cast<_Diff>(_Ret % _Index);
}
}
} With regard to breaking ABI, there are currently no changes to data members or additional non-static methods in my change. |
This doesn't seem more "ABI breaking" than any other behavior change: If we avoid changing the representation of |
Ok, I'm convinced. (Note that adding non-static member functions is totally fine for ABI; only |
I should clarify this for the lurkers: when we change the behavior of an internal function - something with an ugly name that users don't know about or call directly - we do consider it to be an ABI break. Users can't be expected to know that With "pretty" functions that users do call the burden is on them to recompile anything that touches e.g. |
Leaving this here to avoid confusion about my previous suggestion. Seems like I took a wrong turn with my approach. The operation for an index of a power of 2 should just be |
@CaseyCarter does this mean
that a proposal to standardize outputs of distributions in |
There are many different kinds of breaking change. Assuming that we could standardize the outputs of distributions without storing additional state, that proposal would be a behavioral change - and therefore source-breaking - but not ABI breaking. |
<random>
: Optimize uniform_int_distribution
with multiply-shift
I reimplemented
uniform_int_distribution
long ago:STL/stl/inc/xutility
Lines 4878 to 4942 in 53cdb9f
This technique described by Daniel Lemire should be a significant improvement: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
The text was updated successfully, but these errors were encountered: