-
-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Uniform<f32/f64> and gen_range<f32/f64> honor limits strictly. Resolves #476 #477
Conversation
I guess travis answered the last paragraph... |
Cool. Seems like that takes care of the no_std problems. |
I understand your motivation, but I am not sure I agree with the method. It is not impossible to make a half-open range exact, i.e. to guarantee That could be faster for many samples, but slower for few or only one. Can you measure one thing: what do the range benchmarks do with the extra loop? |
This feels premature to me since (a) it is difficult in general to ensure strict limits with FP, thus often better to design algorithms to accommodate (small) deviations, and (b) not all users care about strict compliance anyway. That said if there is very little (performance/maintenance) overhead then this may be reasonable, but given all the extra code I doubt that is the case. |
Given the low performance cost here I think it's worth providing the guarantee that the generated values are within the requested range. I've implemented @pitdicker suggestion to adjust I'll pull numbers to see if this is visible in our benchmarks. |
So I ran some performance numbers. At first Then I added Also confused why some builds seem unable to find the |
Sorry, I've been a bit slow this weekend. Will have a better look tomorrow. We had trouble with confusing benchmarks in combination with The best thing to do than is to try again with
After:
So the difference in both benchmarks is 0.412 ns.
I find it strange also, but would not worry about it. If you rebase on master you'll see all the |
Yeah, using Before:
After:
So a 32% slowdown for But for the more performance-optimized API of creating a Definitely more slowdown than desired for |
As a test, I tried adding branch-predictor hints to the
Which is a 23% slowdown for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure whether to accept this PR though it may be a good choice, especially in light of #494. However, a few changes are needed.
src/distributions/uniform.rs
Outdated
offset: offset, | ||
assert!(low.is_finite() && high.is_finite(), "Uniform::new called with non-finite boundaries"); | ||
|
||
let mut scale = high - low; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can overflow to infinite even if both inputs are finite. Skip the above check (you already know neither value is an NaN) and just assert scale < INFINITY
.
Also means you can simplify your Float
trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally didn't want to panic when high
and low
are both finite numbers. Even if high-low
results in a non-finite number. Mainly because it felt like a lot to ask for users to know when that difference might overflow into infinity. But a little because depending on that the difference can be expressed by an $ty is technically an implementation detail. Another implementation of this function might not.
Either way, the code below deals fine with if scale
becomes infinite here.
src/distributions/uniform.rs
Outdated
let mut scale = high - low; | ||
|
||
let max_rand = (::core::$uty::MAX >> $bits_to_discard) | ||
.into_float_with_exponent(0) - 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is 1-ε/2 (i.e. the largest representable number < 1), but the name makes sense.
src/distributions/uniform.rs
Outdated
} | ||
|
||
debug_assert!(scale >= 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If high
is the next representable number above low
then scale
is effectively zero — although larger values than zero may still give the same rounding. Probably though we should accept zero.
Note that your while
loop would take a long time to get to zero since high - low
may be much larger than the minimal representable value, though it likely doesn't matter since it should encounter a value of scale
small enough that low + scale
rounds to low
much sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm banking on that scale * max_rand + low > high
only happens in situations with unfortunate rounding, so we should only need to make small adjustments to scale
to get it within the acceptable range. Especially since max_rand
is already < 1.
src/distributions/uniform.rs
Outdated
.into_float_with_exponent(0) - 1.0; | ||
|
||
while scale * max_rand + low >= high { | ||
scale = <$ty>::from_bits(scale.to_bits() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always correct when the fraction part is zero, and does it handle sub-normals correctly? I think so. In any case this feels unsafe to me — better to have a single next_smallest_value()
method in your Float
trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this will work as long as long as we don't cross the line from positive to negative numbers. Including when it results in reducing the exponent, and in the sub-normal range.
IEEE numbers were designed specifically such that for non-NaN numbers you can check which is bigger of two numbers by comparing their binary representation as if it was integers. Though you have to filter out the sign-bit first, but since we're dealing with positive numbers that won't be a problem.
It even works for infinity, which I am relying on here.
src/distributions/uniform.rs
Outdated
// as it could be, we'll still generate good values. | ||
// But give it one try to increase our chances that | ||
// Uniform::new_inclusive is actually inclusive of the high value. | ||
let new_scale = <$ty>::from_bits(scale.to_bits() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's sufficient justification for this code — it's extra complexity to slightly improve the range in a case where it already meets all required constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion. But without this code it seems like there is very little practical difference between new
and new_inclusive
. I.e. it's very unlikely that the high
value can actually be generated. It's arguably still very unlikely, but at least this code gives it a try.
But again, I'm not super opinionated on this. I'm not sure what the use case is for inclusive ranges for floats, but obviously the API requires it, so we need to stick something in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there many cases where this code will do something useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to measure usefulness given that I don't know of use cases for inclusive ranges for floats.
Given that there was a comment in the code which pointed out that new
and new_inclusive
generated the same values, do we have any use cases in mind where that is a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach here could be to implement new_inclusive
as
let max_rand = (::core::$uty::MAX >> $bits_to_discard)
.into_float_with_exponent(0) - 1.0;
let mut scale = (high - low) / max_rand;
// Adjust scale lower until the max value we can generate is `high`
while scale * max_rand + low > high {
scale = <$ty>::from_bits(scale.to_bits() - 1);
}
debug_assert!(scale >= 0.0);
UniformFloat { low, scale }
I don't know if the / max_rand
actually makes a difference. But from a mathematical point of view it at least means that we increase the chance of being able to generate high
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 1.0 / max_rand > 1.0
so it probably does do something. Can the compiler optimise to multiplication or would it be better to write (high - low) * (1.0 / max_rand)
?
The implementation is simpler at least — unless the current version has a clear advantage this looks preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use this approach rather than the one I'm using right now.
Is the loss of precision writing (high - low) * (1.0 / max_rand)
really worth making the ctor faster? I.e. is a ctor performance really a priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Possibly not, but I think new_inclusive
may get used for single-usage objects quite a bit. But anyway, we're not really losing precision since the while
loop then reduces scale
.
src/distributions/uniform.rs
Outdated
|
||
// Get a value in the range [0, 1) in order to avoid | ||
// overflowing into infinity when multiplying with scale | ||
let value0_1 = value1_2 - 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loses a bit of precision everywhere just to handle scales close to what would round to infinity. I'm not sure it's worth it; we could instead assert that scale * 2
is finite in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't lose any precision. Subtracting 1.0 should not result in any rounding or loss of data. In fact, it looks to me like it should gain a tiny amount of precision since we're directly adding low
rather than the calculated (and thus possibly slightly rounded) offset
.
But it does lose performance.
Either way, not having this here is certainly possible, but I think it'll require a decent amount more code in the constructor. We'll have to worry about not just adjusting the scale
, but also adjusting offset
such that rounding can't result in values lower than low
.
So it's a tradeoff of slower and more code complexity in the constructor, vs. saving around 8% in the sampling code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right about the precision.
I don't know what's the right answer regarding performance.
|
||
// Get a value in the range [0, 1) in order to avoid | ||
// overflowing into infinity when multiplying with scale | ||
let value0_1 = value1_2 - 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, you're dropping a bit of precision everywhere. So why not do value1_2 * scale + offset
as before, then at the end of the loop check whether res
is finite and panic if not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but I see this only costing performance, not any precision. But do let me know if I'm wrong?
In this case it should really cost very little performance though. While we need an additional subtraction here, we are also saving a subtraction since we're not calculating offset
. Though for constant inputs, offset
and scale
can be computed at compile time. But arguably someone caring about performance and using constant ranges should create a Uniform
.
Additionally, using the offset
approach means that we have to check the resulting value both against high
and against low
, so the resulting code is slower. Here's the perf numbers I get:
Using value1_2 * scale + offset
test gen_range_f32 ... bench: 4,835 ns/iter (+/- 857) = 827 MB/s
test gen_range_f64 ... bench: 6,896 ns/iter (+/- 641) = 1160 MB/s
Using value0_1 * scale + low
test gen_range_f32 ... bench: 4,335 ns/iter (+/- 251) = 922 MB/s
test gen_range_f64 ... bench: 5,712 ns/iter (+/- 916) = 1400 MB/s
a4c6b65
to
8828cad
Compare
Here's an alternative approach for implementing However it seems hangs for really large values of I'm sure it can be fixed, but ultimately I don't think the complexity and precision loss is worth it. And the algorithm here has effectively unbounded performance cost, which is also a concern. So I think we should stick with the approach in the PR. fn new(low: Self::X, high: Self::X) -> Self {
assert!(low < high, "Uniform::new called with `low >= high`");
assert!(low.is_finite() && high.is_finite(), "Uniform::new called with non-finite boundaries");
fn adjust_down(val: $ty) -> $ty {
if val > 0.0 {
<$ty>::from_bits(val.to_bits() - 1)
} else if val < 0.0 {
<$ty>::from_bits(val.to_bits() + 1)
} else {
-<$ty>::from_bits(1)
}
}
let max_rand = (::core::$uty::MAX >> $bits_to_discard)
.into_float_with_exponent(0);
let min_rand = (0 as $uty >> $bits_to_discard)
.into_float_with_exponent(0);
let mut low_adjusted = low;
let mut high_adjusted = high;
loop {
let scale = high_adjusted - low_adjusted;
let offset = low_adjusted - scale;
if !(scale * max_rand + offset < high) {
high_adjusted = adjust_down(high_adjusted);
} else if !(scale * min_rand + offset >= low) {
low_adjusted = -adjust_down(-low_adjusted);
} else {
return UniformFloat { scale, offset };
}
}
} |
src/distributions/uniform.rs
Outdated
scale: scale, | ||
offset: offset, | ||
assert!(low <= high, "Uniform::new_inclusive called with `low > high`"); | ||
assert!(low.is_finite() && high.is_finite(), "Uniform::new_inclusive called with non-finite boundaries"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reduce the bounds checks a bit: (NEG_INFINITY < low) & (low <= high) & (high < INFINITY)
. Or maybe &&
but in theory that can short-circuit which isn't useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a great idea! But I don't understand why we wouldn't want to short-circuit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it might prevent the processor testing multiple things in parallel, and because the common case is that all these tests pass. I don't know; it might be optimisable to the same thing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out that the old approach is actually slightly faster:
&
test gen_range_f32 ... bench: 4,459 ns/iter (+/- 169) = 897 MB/s
test gen_range_f64 ... bench: 6,027 ns/iter (+/- 334) = 1327 MB/s
&&
test gen_range_f32 ... bench: 4,464 ns/iter (+/- 163) = 896 MB/s
test gen_range_f64 ... bench: 6,038 ns/iter (+/- 592) = 1324 MB/s
old
test gen_range_f32 ... bench: 4,382 ns/iter (+/- 135) = 912 MB/s
test gen_range_f64 ... bench: 5,665 ns/iter (+/- 682) = 1412 MB/s
Technically none of these differences are stasticially significant, however it definitely looks like the last is faster for f64. And it does provide better error messages.
My best guess is that the old approach ends up being implementable by using bit logic, whereas the NEG_INF < low
checks has to be done using slower floating point instructions.
3b1ab6e
to
c1cbf32
Compare
c1cbf32
to
d2d4fcd
Compare
28b3fbf
to
42f79af
Compare
I've rebased this on top of the newly landed SIMD code, which was definitely non-trivial. The good news is that it's enforcing the limits for both primitives and SIMD types. I also found a way to make I'd really like to get this landed though. Especially since the weighted sampling has landed and currently risks panicing randomly if you use f32/f64 weights. |
47f2998
to
e487112
Compare
So I ran some performance numbers, and got quite strange results. Basically if I remove the assert!(low.all_finite() && high.all_finite(),
"Uniform::sample_single called with non-finite boundaries"); then I see no performance degradation for However, if I add back the above code, then I see a 50% slowdown. Even though the above code never runs during the benchmark. In the new implementation the code above is only run if I suspect that the problem is related to when the optimizer decides to inline some critical function, but sprinkling #[inline(always)] everywhere doesn't seem to make a difference as long as the above code is there. But I suspect that this problem is one of the realities of using micro-benchmarks. I.e. I don't think a real-world usage of In theory we could break out some of the code into a separate non-inlined function. But since it'd lead to significant code duplication, I don't think it's worth it. Especially since we don't know how a future version of rustc will optimize these things. |
Oh, and I should say that I got these numbers using |
e487112
to
9527481
Compare
Perhaps you could simplify your test? assert!(scale.all_finite(), "Uniform::sample_single called with non-finite boundaries"); |
Unfortunately |
I should also mention that the |
But do we really care about "nearly infinite" ranges? If someone really is treading so close to the limits of |
I tried just checking I'm also a bit hesitant to change the API to work around optimizer quirks which could change with so many parameters, like future versions of rustc, what the calling code does, and which rng is used. FWIW, I just tried using |
9527481
to
be10dba
Compare
Okay, I think it's time I stopped opposing this based on performance overhead; it's barely above the noise threshold (notice how unrelated numbers change here):
Is this ready? Please rebase, and please separate the new benchmarks into a separate commit (so it's easier to do before/after comparisons). |
Only question I had remaining is if we want to merge the |
be10dba
to
4f8601b
Compare
4f8601b
to
ccb3c3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care much about renaming (these are only internal, right?) but it's probably better to keep the two traits separate if there's no reason to merge.
src/distributions/utils.rs
Outdated
#[inline(always)] | ||
fn extract(self, index: usize) -> Self { assert_eq!(index, 0); self } | ||
#[inline(always)] | ||
fn replace(self, index: usize, new_value: Self) -> Self { assert_eq!(index, 0); new_value } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more appropriate to use debug_assert
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to replicate the behavior of these function on real SIMD types, and hoping that the compiler would optimize the checks away.
Ultimately doesn't really matter right now. These functions are only used in tests.
Happy to go either way.
I did change decrease_masked
though to not cast the bool flag to an int. Instead I made it part of the contract that at least one "lane" must be set and added debug_assert
s to ensure that. That way we can always subtract 1
. This might have gotten optimized away, but we might as well be safe.
ccb3c3f
to
6d2b7e5
Compare
Yeah, they're only used internally. It's slightly less code to merge them, so I added a commit for that. But I kept the commit separate so feel free to take it or not. |
Sneaky! But it saves more code than I realised due to all the import rules. |
Sorry, intended to provide the option, not to sneak it in. |
This adds a small amount of overhead to Uniform<f32/f64> and gen_range<f32/f64>. For Uniform<f32/f64> it adds one subtraction and one branch in the common case. For gen_range<f32/f64> it adds just one branch in the common case. I see no way around this while also being sure to stay in the requested range.
This does use f64::from_bits/to_bits/is_infinite/is_finite which might only be available through std? Though it looks like core::num also make them available here? I can't tell given that the functions are marked as stable, but the trait marked unstable.
I'm not sure what are ok build requirements for no_std builds? And is there any way to run
cargo test
, or even justcargo build
in no_std mode?