Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement weighted sampling #72

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

Conversation

rstub
Copy link
Member

@rstub rstub commented Oct 7, 2023

  • w/o replacement is currently implemented in R
  • w/ replacement uses either probabilistic sampling or the alias method

fixes #18
fixes #45
fixes #52

  • (more) checks / assertions
  • documentation
  • implement w/o replacement in C++
  • special case n = 2 (weighted coin)
  • validate the n < 1000 * size cut-over point between bitset and hashset

* w/o replacement is currently implemnented in R
* w/ replacement uses either probabilistic sampling or the alias method
@rstub
Copy link
Member Author

rstub commented Oct 7, 2023

Problematic benchmark from #52 looks much better now:

library(dqrng)

m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)    22.42ms   25.5ms      38.3
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)   7.96ms   8.78ms     114.


m <- 1e1
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)      227µs    245µs     3976.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)    113µs    125µs     7508.

Created on 2023-10-07 with reprex v2.0.2

However, there is still some potential for improvement in the case of uneven weight distribution:

library(dqrng)

m <- 1e6
n <- 1e4
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)     18.3ms   20.5ms      47.5
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)   21.7ms   22.5ms      43.0


m <- 1e1
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)      161µs    189µs     4914.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)    122µs    135µs     7011.

Created on 2023-10-07 with reprex v2.0.2

Similar to unweighted case. Two variants with stochastic acceptance (fast for even weight distribution) and alias method. These methods seem to be interesting for selection ratios < 0.5 (also similar to unweighted case).
@rstub
Copy link
Member Author

rstub commented Oct 12, 2023

Interestingly the methods doing set-based rejection sampling from the last commit have better performance than the exponential rank. At least when size < n/2. Similar to the w/ replacement case it is a bit messy whether to use stochastic acceptance or the alias method.

@rstub
Copy link
Member Author

rstub commented Oct 24, 2023

For unweighted sampling the n < 1000 * size cut-over point between bitset and hashset is (still) valid, c.f. https://stubner.me/2023/10/algorithms-for-unweighted-sampling-without-replacement/

Recreate RcppExports.cpp with current development version of Rcpp to fix WARN on CRAN
Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 128a3cd is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if c5c07e5 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation.

@rstub
Copy link
Member Author

rstub commented Aug 26, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dqsample.int with arg prob sinks performance 'prob' argument for dqrng::dqsample weighted sampling
1 participant