Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KS tests for weighted sampling #1530

Merged
merged 21 commits into from
Nov 26, 2024
Merged

Add KS tests for weighted sampling #1530

merged 21 commits into from
Nov 26, 2024

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented Nov 18, 2024

  • Added a CHANGELOG.md entry

Motivation

Some of these are non-trivial distributions we didn't really test before.

To validate solution of #1476.

Details

Single-element weighted sampling is simple enough.

fn choose_two_iterator is also simple enough: there are no weights, so we can just assign each pair of results a unique index in the list of 100 * 99 / 2 possibilities (nothing that we sort pairs since the order of chosen elements is not specified).

fn choose_two_weighted_indexed gets a bit more complicated; I choose to approach it by building a table for the CDF of size num*num including impossible variants. Most of the tests don't pass, so there must be a mistake here.

Aside: using let key = rng.random::<f64>().ln() / weight; (src/seq/index.rs:392) may help with #1476 but does not fix the above.

@dhardy
Copy link
Member Author

dhardy commented Nov 19, 2024

I can confirm that choose_multiple_weighted has a significant problem, since sampling two elements from 0, 1, 2 with weights 1, 1/2, 1/3 a million times and sorting yields 532298 counts of (0, 1), 338524 counts of (0, 2) and 129178 counts of (1, 2). (Unlike #1476, this example does not require very small weights.)

This is sampling without replacement, so expected samples are:

  • (0,1) or (1, 0): 531818
  • (0, 2) or (2, 0): 339393
  • (1, 2) or (2, 1): 128788

@dhardy dhardy marked this pull request as ready for review November 19, 2024 11:02
@dhardy
Copy link
Member Author

dhardy commented Nov 19, 2024

I fixed my calculation of the CDF, found a variant which failed like #1476, fixed this by taking the logarithm of keys, and applied some optimisation to the Efraimidis-Spirakis algorithm.

@dhardy dhardy mentioned this pull request Nov 23, 2024
1 task
Copy link
Collaborator

@benjamin-lieser benjamin-lieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct. We might want to test how the performance is for big amount

distr_test/tests/weighted.rs Show resolved Hide resolved
}

#[test]
fn choose_two_weighted_indexed() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably more complex than needed, but looks correct.
It's probably worth implementing chi squared at some point, but this should also be quite sensitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably more complex than needed, but looks correct.

You mean the use of an Adapter? Yes, but I'd sooner do this than revise the KS test API (which is well adapted for other usages).

It's probably worth implementing chi squared at some point, but this should also be quite sensitive.

A fair point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean using KS for these distributions (chi squared would be more straight forward), Adapter I think is fine.

src/seq/index.rs Outdated Show resolved Hide resolved
src/seq/index.rs Outdated Show resolved Hide resolved
src/seq/index.rs Outdated Show resolved Hide resolved
@dhardy
Copy link
Member Author

dhardy commented Nov 26, 2024

@benjamin-lieser I addressed most of your feedback, using BinaryHeap as suggested. Best-of-three runs before this change (a039a7f):

running 3 tests
test choose_weighted_indexed ... ok
test choose_one_weighted_indexed ... ok
test choose_two_weighted_indexed ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 6 filtered out; finished in 3.69s

Best-of-three after (831353f):

running 3 tests
test choose_weighted_indexed ... ok
test choose_one_weighted_indexed ... ok
test choose_two_weighted_indexed ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 6 filtered out; finished in 3.98s

Yes, that's a regression, though given that we're using amount ∊ {1, 2} these tests aren't very representative benchmarks. I agree that in theory this approach should scale better to large amounts.

@dhardy
Copy link
Member Author

dhardy commented Nov 26, 2024

I added a benchmark (using weights of 1.0 everywhere, which is probably the worst case for the jumps in the A-ExpJ algorithm.

Edit: I removed reports of outliers (4-19% so it's clear my machine is not ideally configured for benchmarks, but still good enough for our conclusions).

Before this PR (0ff946c; 554d331 is similar):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [2.8688 µs 2.8842 µs 2.9014 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [5.0881 µs 5.1248 µs 5.1672 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [611.72 ns 614.87 ns 618.17 ns]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [695.34 ns 699.24 ns 703.74 ns]

At d645952 (the first change to fn sample_efraimidis_spirakis, which is just taking the logarithm of the key to fix #1476):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [6.7046 µs 6.7319 µs 6.7612 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [8.9848 µs 9.0293 µs 9.0784 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [1.0019 µs 1.0062 µs 1.0108 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [1.0887 µs 1.0953 µs 1.1020 µs]

At b806b29 ("keep at most amount candidates", using sort_unstable at each addition beyond the first amount):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [5.8209 µs 5.8382 µs 5.8574 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [4.7545 ms 4.7821 ms 4.8133 ms]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [2.0914 µs 2.0966 µs 2.1028 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [33.773 µs 33.887 µs 34.015 µs]

At a039a7f (0f662b1 is similar):

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [1.0601 µs 1.0626 µs 1.0652 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [450.93 µs 451.58 µs 452.26 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [1.7520 µs 1.7559 µs 1.7602 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [7.6607 µs 7.6817 µs 7.7069 µs]

At 831353f:

seq_slice_choose_multiple_weighted_1_of_1000
                        time:   [1.0498 µs 1.0509 µs 1.0523 µs]
seq_slice_choose_multiple_weighted_950_of_1000
                        time:   [29.491 µs 29.545 µs 29.618 µs]
seq_slice_choose_multiple_weighted_10_of_100
                        time:   [1.7660 µs 1.7695 µs 1.7738 µs]
seq_slice_choose_multiple_weighted_90_of_100
                        time:   [2.9548 µs 2.9593 µs 2.9649 µs]

That's not exactly what I expected (my previous changes were guided by the time to run the KS test).

According to the benchmarks we would be better off with the code of d645952, however one thing not represented here is memory usage: that old version of fn sample_efraimidis_spirakis stores length number of elements while more recent commits store only amount elements.

I think therefore we should keep using this latest version.

Copy link
Collaborator

@benjamin-lieser benjamin-lieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should use this version. The other one might be faster in our benchmarks where length and amount are not too large, but it is not true to the idea of the algorithm and users might expect it to have the correct memory footprint.

@dhardy dhardy merged commit bf9d429 into rust-random:master Nov 26, 2024
16 checks passed
dhardy added a commit to dhardy/rand that referenced this pull request Nov 26, 2024
dhardy added a commit to dhardy/rand that referenced this pull request Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants