Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform float improvements #1289

Merged
merged 6 commits into from
May 1, 2023
Merged

Uniform float improvements #1289

merged 6 commits into from
May 1, 2023

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented Feb 21, 2023

  • Add a (possibly overcomplicated) benchmark for uniform float sampling
  • Add a couple extra tests
  • Add an explicit impl of sample_single_inclusive (~20% perf increase)

Note: sample_single and sample_single_inclusive use
different code paths for sampling.
Around 20% faster for f32, slightly less for f64.
@dhardy
Copy link
Member Author

dhardy commented Feb 21, 2023

Closes #1270. This branch was created to evaluate the mentioned paper: https://hal.science/hal-03282794/document

The paper ignores possible underflow, but notes the following potential issues for uniform FP sampling from a defined range:

  • Available precision may vary throughout the given range due to differing exponent
  • The method we use to sample from [0, 1) can only sample half the possible values just below 1
  • Translating from x in [0,1) to a range [a, b) via common methods (such as a + (b-a)*x) can yield value b due to loss of precision. We have special measures for dealing with this.
  • b - a may overflow. (Note: I personally do not think this a big deal since it is rare to use FP values near the limits of their range.)
  • Most transformations do not produce equidistant FP values. In some cases such as Monte Carlo estimation of PI this can impact the result, but in practice I believe it is rarely an issue.

GSL's method, a(1-x) + bx, translates to a(2-z) + b(z-1) where our samples are z in [1, 2) and x = z - 1. This method was benchmarked, with approx 12% performance loss for f32 single-sampling, 2% loss for f64, 10% loss for f32 distribution sampling, and 19% loss for f64. Only 3 of 24 benchmarks showed a performance gain. An attraction of this method is that it avoids the possible overflow of calculating b - a.

The method using y = 2(a/2 + (b/2 - a/2)x) has accuracy issues on sub-normals. In particular, something like f32::from_bits(3) / 2.0 rounds up, thus the result may exceed the upper bound. This is a more significant issue than overflow, hence this method is disqualified from further analysis.

The paper defines method SESS (Same Exponent Same Sign) which we shall ignore due to the narrow circumstances in which it is applicable.

The paper also defines method γsection for general applicability. This is used for sample_single_inclusive, translated to Rust from the Julia snippet γsectionCC as follows:

let (a, b) = (low, high);

let g = (a.next_up() - a).max(b - b.next_down());
let s = b/g - a/g;
let eps = if a.abs() <= b.abs() {
    -a/g - (s - b/g)
} else {
    b /g - (s + a/g)
};
let si = s.ceil() as $uty;
let hi = if s != si as $ty {
    si
} else {
    si + (eps > 0.0) as $uty
};

let k = (0..=hi).sample_single(rng).unwrap();
Ok(if a.abs() <= b.abs() {
    if k == hi {
        a
    } else {
        b - (k as $ty) *g
    }
} else {
    if k == hi {
        b
    } else {
        a + (k as $ty) *g
    }
})

Results are around 10 times slower for f32 and 6 times for f64:

$ cargo bench --bench uniform_float --features small_rng -- -b baseline inclusive
   Compiling rand v0.9.0 (/home/dhardy/projects/rand/rand)
    Finished bench [optimized] target(s) in 1.47s
     Running benches/uniform_float.rs (target/release/deps/uniform_float-177b17dd872a054a)
single_random_f32/SmallRng/sample_inclusive
                        time:   [17.293 ns 17.337 ns 17.377 ns]
                        change: [+928.42% +930.48% +934.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7868 outliers among 100000 measurements (7.87%)
  7868 (7.87%) high mild
single_random_f32/ChaCha8Rng/sample_inclusive
                        time:   [18.027 ns 18.072 ns 18.118 ns]
                        change: [+832.83% +835.21% +837.39%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5987 outliers among 100000 measurements (5.99%)
  5987 (5.99%) high mild
single_random_f32/Pcg32/sample_inclusive
                        time:   [17.056 ns 17.096 ns 17.139 ns]
                        change: [+913.05% +916.00% +918.69%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7752 outliers among 100000 measurements (7.75%)
  7752 (7.75%) high mild
single_random_f32/Pcg64/sample_inclusive
                        time:   [17.796 ns 17.838 ns 17.880 ns]
                        change: [+893.31% +895.96% +898.01%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5205 outliers among 100000 measurements (5.21%)
  5205 (5.21%) high mild

single_random_f64/SmallRng/sample_inclusive
                        time:   [22.094 ns 22.143 ns 22.189 ns]
                        change: [+571.29% +572.87% +574.19%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6735 outliers among 100000 measurements (6.74%)
  6735 (6.74%) high mild
single_random_f64/ChaCha8Rng/sample_inclusive
                        time:   [23.074 ns 23.124 ns 23.173 ns]
                        change: [+445.36% +446.46% +447.50%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3917 outliers among 100000 measurements (3.92%)
  3917 (3.92%) high mild
single_random_f64/Pcg32/sample_inclusive
                        time:   [22.403 ns 22.448 ns 22.493 ns]
                        change: [+570.28% +571.57% +573.10%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4182 outliers among 100000 measurements (4.18%)
  4182 (4.18%) high mild
single_random_f64/Pcg64/sample_inclusive
                        time:   [22.518 ns 22.565 ns 22.611 ns]
                        change: [+518.80% +520.10% +521.35%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5656 outliers among 100000 measurements (5.66%)
  5656 (5.66%) high mild

In comparison, our current implementations benchmark at 1-1.5ns for distribution sampling, ~2ns for single f32 sampling and ~4ns for single f64 sampling on an AMD 5800X at 3.8GHz.

The paper's implementation of SESS benchmarked at 10ns per value for both f32 and f64, while most other methods benchmarked at approx 15ns per value on an Intel 6700HQ at 2.6GHz. It would appear that there is room for optimisation both in the SESS implementation above and in the other implementations as used in the paper, however I am doubtful that the method could perform competitively with our current method.

Note also that #531 should (if I remember correctly) be able to offer more precise sampling and with similar performance, thus I do not see much benefit in considering the γsection method further.

@dhardy
Copy link
Member Author

dhardy commented Feb 21, 2023

Test failure should be resolved by #1287 (use of feature small_rng for testing benchmarks).

@vks could you review please?

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Feb 22, 2023

Are there any significant optimizations we can make for the [0,1) case?

@dhardy
Copy link
Member Author

dhardy commented Feb 23, 2023

I didn't look at that here, but I don't think so (we essentially just mask and bit-cast an integer, then subtract 1).

benches/uniform_float.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@vks vks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I only have some small questions about the benchmarks.

benches/uniform_float.rs Show resolved Hide resolved
benches/uniform_float.rs Show resolved Hide resolved
src/distributions/uniform.rs Outdated Show resolved Hide resolved
dhardy added 2 commits March 18, 2023 09:54
Also avoid using internal API of Uniform distribution.
@vks vks merged commit d4a2945 into rust-random:master May 1, 2023
@dhardy dhardy deleted the uniform-float branch July 8, 2024 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants