Implement weighted sampling #72

rstub · 2023-10-07T20:23:03Z

~~w/o replacement is currently implemented in R~~
w/ replacement uses either probabilistic sampling or the alias method

fixes #18
fixes #45
fixes #52

(more) checks / assertions
documentation
implement w/o replacement in C++
special case n = 2 (weighted coin)
validate the n < 1000 * size cut-over point between bitset and hashset

* w/o replacement is currently implemnented in R * w/ replacement uses either probabilistic sampling or the alias method

rstub · 2023-10-07T20:54:52Z

Problematic benchmark from #52 looks much better now:

library(dqrng)

m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)    22.42ms   25.5ms      38.3
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)   7.96ms   8.78ms     114.


m <- 1e1
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)      227µs    245µs     3976.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)    113µs    125µs     7508.

^{Created on 2023-10-07 with reprex v2.0.2}

However, there is still some potential for improvement in the case of uneven weight distribution:

library(dqrng)

m <- 1e6
n <- 1e4
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)     18.3ms   20.5ms      47.5
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)   21.7ms   22.5ms      43.0


m <- 1e1
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)      161µs    189µs     4914.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)    122µs    135µs     7011.

^{Created on 2023-10-07 with reprex v2.0.2}

Similar to unweighted case. Two variants with stochastic acceptance (fast for even weight distribution) and alias method. These methods seem to be interesting for selection ratios < 0.5 (also similar to unweighted case).

rstub · 2023-10-12T14:41:11Z

Interestingly the methods doing set-based rejection sampling from the last commit have better performance than the exponential rank. At least when size < n/2. Similar to the w/ replacement case it is a bit messy whether to use stochastic acceptance or the alias method.

rstub · 2023-10-24T12:46:29Z

For unweighted sampling the n < 1000 * size cut-over point between bitset and hashset is (still) valid, c.f. https://stubner.me/2023/10/algorithms-for-unweighted-sampling-without-replacement/

Recreate RcppExports.cpp with current development version of Rcpp to fix WARN on CRAN

Merge branch 'master' into feature/weighted-sampling-2 # Conflicts: # DESCRIPTION # NEWS.md

github-actions · 2024-04-22T20:15:36Z

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

github-actions · 2024-04-23T17:05:25Z

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

github-actions · 2024-04-23T17:17:42Z

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 128a3cd is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

github-actions · 2024-04-27T19:35:59Z

This is how benchmark results would change (along with a 95% confidence interval in relative change) if c5c07e5 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

rstub · 2024-08-26T05:34:47Z

Something to consider here as well: https://notstatschat.rbind.io/2024/08/26/another-way-to-not-sample-with-replacement/

Implement weighted sampling

b2c96e0

* w/o replacement is currently implemnented in R * w/ replacement uses either probabilistic sampling or the alias method

rstub added 8 commits October 7, 2023 23:15

additional test

9a436a7

Reduce false positive misses in test coverage

e7407e2

Implement weighted samlping w/o replacement in C++

dbe3072

Remove unnecessary struct

9c9402b

Replace int with INT

29ce7b4

Add some checks

8414aec

Add fair and biased coin for n == 2

42814ff

Documentation

60cdf8f

rstub mentioned this pull request Oct 11, 2023

References and a memory friendly cccrank method krlmlr/wrswoR#7

Open

rstub added 4 commits October 11, 2023 15:30

Allow for large output size to trigger dqsample_num

ce16f22

Factor out creation of alias table

8ed7d33

Remove a compiler warning

41498c4

Add set-based no-replacement methods for weighted sampling

8c7683c

Similar to unweighted case. Two variants with stochastic acceptance (fast for even weight distribution) and alias method. These methods seem to be interesting for selection ratios < 0.5 (also similar to unweighted case).

rstub added 9 commits October 12, 2023 21:20

Add messages to static_assert to not force usage of C++17

54aa1fd

Initial rules when to use which algorithm

36b01c9

Draft documentation

66c7bb7

Test both weighted coin options

60a5303

Add references

06e2ced

Use n/size instead of m/n as arguments

ed142a3

Add more formal references

22c5570

Document n=2 case

17f6b94

Add news and bump version

cd6285f

rstub mentioned this pull request Oct 13, 2023

Improve sample performance for small sets #53

Closed

Fix off-by-one error

f736a66

rstub added 2 commits January 20, 2024 18:57

C++11 does not allow auto with arguments

d7dcac3

Changes from version 0.3.2

386558a

Recreate RcppExports.cpp with current development version of Rcpp to fix WARN on CRAN

rstub added 6 commits January 27, 2024 21:36

Merge changes from master

92be4ec

Merge branch 'master' into feature/weighted-sampling-2 # Conflicts: # DESCRIPTION # NEWS.md

Compare the two weights directly for n=2

499ab57

Merge branch 'main' into feature/weighted-sampling-2

81fc04d

Update sampling code to not use deprected dqrng::uniform01()

d489a1b

Get closer to original touchstone config

557bf06

Use GHA files from styler package

74ef89a

Update performance testing

482687a

Merge branch 'main' into feature/weighted-sampling-2

1162bf0

Add tests with uneven weight distribution

60cb414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement weighted sampling #72

Implement weighted sampling #72

rstub commented Oct 7, 2023 •

edited

Loading

rstub commented Oct 7, 2023

rstub commented Oct 12, 2023 •

edited

Loading

rstub commented Oct 24, 2023

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 27, 2024

rstub commented Aug 26, 2024

Implement weighted sampling #72

Are you sure you want to change the base?

Implement weighted sampling #72

Conversation

rstub commented Oct 7, 2023 • edited Loading

rstub commented Oct 7, 2023

rstub commented Oct 12, 2023 • edited Loading

rstub commented Oct 24, 2023

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 27, 2024

rstub commented Aug 26, 2024

rstub commented Oct 7, 2023 •

edited

Loading

rstub commented Oct 12, 2023 •

edited

Loading