-
-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement SIMD UniformInt #561
Conversation
I don’t know wasm well enough to know why the test is failing |
And it worked yesterday 😢. Seems to be rust-lang/rust#52353 |
I've got a The performance seems very hardware dependent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me FWIW!!
src/distributions/uniform.rs
Outdated
// The `range == 0` case is complicated with multiple lanes, | ||
// avoid it. | ||
assert!(range.eq($unsigned::splat(0)).none(), | ||
"Uniform::new_inclusive called with the full range"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to deal actually implement this would be to set zone to whatever here for lanes where range==0. Then in sample
, change the return statement to:
return range.ne($unsigned::splat(0)).select(self.low + $ty::from(hi), v);
It obviously adds some overhead, but so does our handling in the non-simd case.
It's possible that the overhead is bigger here though. In particular I'm not sure if the $unsigned::splat(0)
call will get compiled into something fast. It would for non-simd types, but I don't know SIMD instructions well enough to know if that's the case here.
If there's no way to make the splat operation fast, you could also try:
return range.ne(hi).select(self.low + $ty::from(hi), v);
src/distributions/uniform.rs
Outdated
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Self::X { | ||
let range = $unsigned::from(self.range); | ||
// TODO: ensure these casts are a no-op | ||
let zone = $unsigned::from($signed::from(self.zone)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think simply doing
$unsigned::from($signed::from(self.zone));
should work. We need multiple casts in the non-simd case because we're also changing the bit-width, but that's not the case here AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err.. I mean
$unsigned::from(self.zone);
src/distributions/uniform.rs
Outdated
$( | ||
let v: &[($ty, $ty)] = &[ | ||
($ty::splat(0), $ty::splat(10)), | ||
($ty::splat(10), $ty::splat(127)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fold this into the existing macro by using FloatSIMDUtils::cast_from_int
and FloatSIMDUtils::all_le
?
Or rather, replace the existing macro with your new macro, and just add the scalar types to it.
I guess we'd still have to keep the full-range tests separate unless we make the changes above.
I should also add that my approval is entirely non-official. It just means that I think the code looks good 😃 |
Here's the Currently investigating the |
Added code to handle the full range case. Builds fail because of the stdsimd changes. I can fix that when those changes stabilize. There seems to be no significant change in benchmarks name no_full_range_support ns/iter full_range_support ns/iter diff ns/iter diff % speedup
+simd::distr_uniform_i16x16 13,230 (2418 MB/s) 12,070 (2651 MB/s) -1,160 -8.77% x 1.10
+simd::distr_uniform_i16x2 8,769 (456 MB/s) 7,062 (566 MB/s) -1,707 -19.47% x 1.24
+simd::distr_uniform_i16x32 43,309 (1477 MB/s) 35,561 (1799 MB/s) -7,748 -17.89% x 1.22
+simd::distr_uniform_i16x4 7,993 (1000 MB/s) 6,250 (1280 MB/s) -1,743 -21.81% x 1.28
+simd::distr_uniform_i16x8 3,695 (4330 MB/s) 3,409 (4693 MB/s) -286 -7.74% x 1.08
-simd::distr_uniform_i32x16 46,434 (1378 MB/s) 54,196 (1180 MB/s) 7,762 16.72% x 0.86
-simd::distr_uniform_i32x2 6,253 (1279 MB/s) 7,861 (1017 MB/s) 1,608 25.72% x 0.80
-simd::distr_uniform_i32x4 7,447 (2148 MB/s) 8,922 (1793 MB/s) 1,475 19.81% x 0.83
+simd::distr_uniform_i32x8 18,559 (1724 MB/s) 17,130 (1868 MB/s) -1,429 -7.70% x 1.08
+simd::distr_uniform_i64x2 6,651 (2405 MB/s) 6,313 (2534 MB/s) -338 -5.08% x 1.05
-simd::distr_uniform_i64x4 15,708 (2037 MB/s) 18,424 (1736 MB/s) 2,716 17.29% x 0.85
+simd::distr_uniform_i64x8 27,984 (2287 MB/s) 27,929 (2291 MB/s) -55 -0.20% x 1.00
+simd::distr_uniform_i8x16 15,296 (1046 MB/s) 12,748 (1255 MB/s) -2,548 -16.66% x 1.20
+simd::distr_uniform_i8x2 10,303 (194 MB/s) 8,399 (238 MB/s) -1,904 -18.48% x 1.23
+simd::distr_uniform_i8x32 27,191 (1176 MB/s) 25,844 (1238 MB/s) -1,347 -4.95% x 1.05
+simd::distr_uniform_i8x4 10,192 (392 MB/s) 10,137 (394 MB/s) -55 -0.54% x 1.01
+simd::distr_uniform_i8x64 146,547 (436 MB/s) 120,560 (530 MB/s) -25,987 -17.73% x 1.22
-simd::distr_uniform_i8x8 11,385 (702 MB/s) 13,822 (578 MB/s) 2,437 21.41% x 0.82 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I'll defer to others for the test stuff.
src/distributions/uniform.rs
Outdated
let ints_to_reject = (unsigned_max - range + 1) % modulo; | ||
// `unsigned_max - 0 + 1 % x` => `0 % x` => `0` so | ||
// the zone for the full range case is `unsigned_max` which | ||
// means only one sample is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any value is ok since the sample
code does (rand * range).low <= zone
. But since range
is 0 the low
part will always be zero. So the test will always pass. But unsigned_max
here is as good as any other value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned_max
is nice as it allows a faster select
. That and "range is 0" should probably be documented over this though.
let my_uniform = Uniform::new_inclusive(low, high); | ||
for _ in 0..1000 { | ||
let v: $ty = rng.sample(my_uniform); | ||
assert!($le(low, v) && $le(v, high)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could introduce something like FloatSIMDUtils
and FloatAsSIMD::splat
, but for integer types. That would allow a lot of this test code to be written more nicely.
But I'll defer to @dhardy and @pitdicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but what's written here is sufficient.
Updated for |
If we want to include benchmarks, 16-bit, 32-bit, and 64-bit lane widths will likely have very different performance on most hardware. At the very least doing benchmarks for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As has already been said, this looks really good! Thanks @TheIronBorn!
There's no sample_single
implementation; possibly it could benefit from the same range << range.leading_zeros()
optimisation, but at this stage it doesn't seem important.
A couple of little points about arithmetic here — if it's wrapping by default (unlike other Rust numeric types) then there are no issues. (But are there plans to change this in the future?)
src/distributions/uniform.rs
Outdated
|
||
#[inline] // if the range is constant, this helps LLVM to do the | ||
// calculations at compile-time. | ||
// TODO: test this ^ for SIMD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on testing? It seems fine. Note that #[inline]
is needed to allow inlining in external crates, unless using (full) link-time-optimisation, so although the compiler might inline this anyway in internal tests, it wouldn't otherwise from a separate lib/bin.
src/distributions/uniform.rs
Outdated
"Uniform::new_inclusive called with `low > high`"); | ||
let unsigned_max = ::core::$u_scalar::MAX; | ||
|
||
let range: $unsigned = ((high - low) + 1).cast(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is SIMD arithmetic wrapping by default?
src/distributions/uniform.rs
Outdated
// replacing 0 with `unsigned_max` allows a faster `select` | ||
// with bitwise OR | ||
let modulo = not_full_range.select(range, $unsigned::splat(unsigned_max)); | ||
let ints_to_reject = (unsigned_max - range + 1) % modulo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If range is 0, this will overflow — or wrap to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap to 0
src/distributions/uniform.rs
Outdated
// bulk implementation | ||
($(($unsigned:ident, $signed:ident),)+ $u_scalar:ident) => { | ||
$( | ||
uniform_simd_int_impl!($unsigned, $unsigned, $signed, $u_scalar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The $signed
parameter appears to never be used.
let my_uniform = Uniform::new_inclusive(low, high); | ||
for _ in 0..1000 { | ||
let v: $ty = rng.sample(my_uniform); | ||
assert!($le(low, v) && $le(v, high)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but what's written here is sufficient.
There is no easy SIMD leading zeros. I implemented one on my other branch using a bit twiddling hack but its performance seems too hardware dependent |
All |
Could you add a note then where we need to use wrapping? Might help avoid bugs in the future. |
Assembly inspection confirms range calculations are done at compile-time |
(512-bit vectors will still need the |
|
Turns out a scalar leading zeros implementation is still fast enough. I'll investigate if there's a similar slowdown for Edit: perhaps only because I was using flat vectors, |
Scalar leading zeros is still faster even when the compiler can't (theoretically) optimize it to a single I'm currently investigating some of the other methods listed here http://www.pcg-random.org/posts/bounded-rands.html. SIMD division is extremely slow so I expect only the bitmask method will be comparable. |
One optimization that might be worthwhile here is to check for small ranges. I.e. when the range can fit in 16 bits, don't use the full 32 and a widening multiply, but just a mask and normal multiply. I imagine this to be a advantage in the common case, because ranges are rarely the size an integer can hold. |
Edit: oh are you suggesting switching from |
That gives a large speedup in an ideal situation (most likely though because it needs only half as many random bits). A proper implementation would be difficult and likely slow because some lanes may still have a large range. We could apply it only when all the lanes are small. (less speedup with fewer lanes, lower bit requirement is trumped by overhead)
|
It might be a nice optimization for shuffles or other cases where we know more about the range. |
The bitmask method seems better name wide_mul ns/iter bitmask_simple ns/iter diff ns/iter diff % speedup
-simd_uniform_i16x16 13,282 (2467 MB/s) 42,491 (771 MB/s) 29,209 219.91% x 0.31
-simd_uniform_i16x2 4,446 (921 MB/s) 18,856 (217 MB/s) 14,410 324.11% x 0.24
-simd_uniform_i16x32 38,002 (1724 MB/s) 87,600 (748 MB/s) 49,598 130.51% x 0.43
-simd_uniform_i16x4 4,656 (1759 MB/s) 20,315 (403 MB/s) 15,659 336.32% x 0.23
-simd_uniform_i16x8 6,408 (2556 MB/s) 24,405 (671 MB/s) 17,997 280.85% x 0.26
+simd_uniform_i32x16 50,561 (1296 MB/s) 26,833 (2442 MB/s) -23,728 -46.93% x 1.88
-simd_uniform_i32x2 5,257 (1558 MB/s) 5,411 (1513 MB/s) 154 2.93% x 0.97
+simd_uniform_i32x4 11,600 (1412 MB/s) 8,789 (1864 MB/s) -2,811 -24.23% x 1.32
+simd_uniform_i32x8 19,774 (1657 MB/s) 14,907 (2198 MB/s) -4,867 -24.61% x 1.33
+simd_uniform_i64x2 9,799 (1672 MB/s) 8,965 (1827 MB/s) -834 -8.51% x 1.09
+simd_uniform_i64x4 16,291 (2011 MB/s) 14,064 (2329 MB/s) -2,227 -13.67% x 1.16
+simd_uniform_i64x8 30,050 (2180 MB/s) 28,011 (2339 MB/s) -2,039 -6.79% x 1.07
+simd_uniform_i8x16 16,832 (973 MB/s) 12,637 (1296 MB/s) -4,195 -24.92% x 1.33
+simd_uniform_i8x2 5,822 (351 MB/s) 5,363 (381 MB/s) -459 -7.88% x 1.09
+simd_uniform_i8x32 24,748 (1324 MB/s) 17,520 (1870 MB/s) -7,228 -29.21% x 1.41
+simd_uniform_i8x4 7,536 (543 MB/s) 6,640 (616 MB/s) -896 -11.89% x 1.13
+simd_uniform_i8x64 147,202 (445 MB/s) 109,336 (599 MB/s) -37,866 -25.72% x 1.35
+simd_uniform_i8x8 9,910 (826 MB/s) 9,794 (836 MB/s) -116 -1.17% x 1.01
+simd_uniform_single_i16x16 82,276 (398 MB/s) 63,487 (516 MB/s) -18,789 -22.84% x 1.30
+simd_uniform_single_i16x2 22,122 (185 MB/s) 16,473 (248 MB/s) -5,649 -25.54% x 1.34
-simd_uniform_single_i16x32 175,026 (374 MB/s) 404,042 (162 MB/s) 229,016 130.85% x 0.43
+simd_uniform_single_i16x4 35,129 (233 MB/s) 21,397 (382 MB/s) -13,732 -39.09% x 1.64
+simd_uniform_single_i16x8 27,776 (589 MB/s) 26,615 (615 MB/s) -1,161 -4.18% x 1.04
+simd_uniform_single_i32x16 81,288 (806 MB/s) 53,903 (1215 MB/s) -27,385 -33.69% x 1.51
+simd_uniform_single_i32x2 9,129 (897 MB/s) 8,398 (975 MB/s) -731 -8.01% x 1.09
+simd_uniform_single_i32x4 17,449 (938 MB/s) 14,151 (1157 MB/s) -3,298 -18.90% x 1.23
+simd_uniform_single_i32x8 40,181 (815 MB/s) 32,625 (1004 MB/s) -7,556 -18.80% x 1.23
+simd_uniform_single_i64x2 24,001 (682 MB/s) 11,544 (1419 MB/s) -12,457 -51.90% x 2.08
+simd_uniform_single_i64x4 40,520 (808 MB/s) 23,478 (1395 MB/s) -17,042 -42.06% x 1.73
+simd_uniform_single_i64x8 85,092 (770 MB/s) 48,127 (1361 MB/s) -36,965 -43.44% x 1.77
+simd_uniform_single_i8x16 66,271 (247 MB/s) 54,895 (298 MB/s) -11,376 -17.17% x 1.21
+simd_uniform_single_i8x2 23,645 (86 MB/s) 18,929 (108 MB/s) -4,716 -19.95% x 1.25
-simd_uniform_single_i8x32 139,309 (235 MB/s) 348,528 (94 MB/s) 209,219 150.18% x 0.40
+simd_uniform_single_i8x4 28,646 (142 MB/s) 23,976 (170 MB/s) -4,670 -16.30% x 1.19
-simd_uniform_single_i8x64 967,238 (67 MB/s) 986,099 (66 MB/s) 18,861 1.95% x 0.98
+simd_uniform_single_i8x8 29,513 (277 MB/s) 27,354 (299 MB/s) -2,159 -7.32% x 1.08 benchmarks here https://github.com/TheIronBorn/rand/blob/simd/benches/simd_uniform_int.rs. (The single modulo and double division methods are too slow) The bitmask method slows down when the range is far from the next power of two, as is happening with the Using a better range: name wide_mul ns/iter bitmask_simple ns/iter diff ns/iter diff % speedup
+simd_uniform_i16x16 17,330 (1890 MB/s) 13,225 (2477 MB/s) -4,105 -23.69% x 1.31
+simd_uniform_i16x2 7,075 (578 MB/s) 5,299 (772 MB/s) -1,776 -25.10% x 1.34
+simd_uniform_i16x32 51,725 (1267 MB/s) 29,134 (2249 MB/s) -22,591 -43.68% x 1.78
+simd_uniform_i16x4 6,961 (1176 MB/s) 5,676 (1443 MB/s) -1,285 -18.46% x 1.23
-simd_uniform_i16x8 9,380 (1746 MB/s) 11,650 (1406 MB/s) 2,270 24.20% x 0.81 There is an more complex bitmask method which tries to use the remaining bits of the random integer upon rejection. If it could be optimized better, it may be a further improvement. |
The biased widening multiply method is much faster so it might be worth providing an extra, faster version. |
@TheIronBorn let me know when you think this is ready for merging — I'm not sure if you're still tweaking. |
If we provide an API which allows |
Still tweaking but I think either the widening multiply method or the bitmask method would work well. The bitmask method seems best for |
I think this is fine to merge now, |
When the range is flat (i.e. If you think a varying range could be used often, do speak up. |
@TheIronBorn could you squash some of the commits together please (i.e. eliminate most of the fix/refactor commits)? |
Squashed |
Not sure what we should do about benchmarks. We really ought to have some benchmarks for
UniformInt
but there are 36 different vectors we could benchmark.