References and a memory friendly cccrank method #7

vgherard · 2021-02-09T13:01:35Z

Hi, sorry for the long delay.

I've added a few references for the Gumbel-Max stuff I was mentioning (Yellott appears to be the first one discussing the method).

I also propose a third version of the *rank() algorithms which uses only O(size) extra space by keeping a small heap of the best size elements seen so far. Even though this version has worse asymptotic time complexity than the other implementations, it turns out in practice to be faster (at least for the few benchmark points I've tried).

Finally, notice that testing on the error message of base sample.int() will fail if the locale is not English. Left a small comment.

Hope any of this is useful,
Cheers,

Valerio.

EDIT: just noticed that the test-coverage workflow has failed, as in your last commit on the master branch. The error occured during the installation of curl, which in turn tries to install multtest, which was removed from CRAN.

codecov · 2022-10-18T02:12:51Z

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

krlmlr · 2022-10-18T02:41:10Z

Thanks, I apologize for my slow response.

Checks pass now.

Kudos on finding the Yellott reference. Do you remember which section(s) would be most relevant to understand the application?

I switched to testthat 3, this should make the tests locale-invariant. Would you mind adding a few tests for the new implementation?

Run: https://github.com/vgherard/wrswoR/actions/runs/3453455746

Run: https://github.com/vgherard/wrswoR/actions/runs/3653712319

rstub · 2023-10-11T11:51:01Z

Hi @vgherard, hi @krlmlr!
Over in daqana/dqrng#72 I am implementing weighted sampling for dqrng. For the w/o replacement case I have at the moment implemented a version of crank, but the other algorithms you have implemented here are interesting as well. Before I adapt them and add you as cph: Would you be interested in contributing directly?

BTW, what is the best way to cite "Accelerating weighted random sampling without replacement"? Pre-print with GitHub URL? Or https://ethz.ch/content/dam/ethz/special-interest/baug/ivt/ivt-dam/vpl/reports/1101-1200/ab1141.pdf?

vgherard · 2023-10-16T09:02:39Z

@krlmlr I completely forgot about this merge request, I'm sorry.

In case this is still useful: I rapidly went through the reference, it appears that the relevant part is contained in the proof of Lemma 6, page 123 of the journal. It's an elementary proof of the fact that the max of "Gumbel" variables is equivalent to a (single) weighted sampling, which is the essence of the sampling algorithm.

Regarding testing: I see that the current version of wrswoR automatically tests all the algorithms implemented, while I see no special algorithm-specific test. Did you have something special in mind?

Thanks and sorry again for the long silence.

@rstub thank you, but from my side I'm quite short of time as you can see from the slow ping-pong here 👀

krlmlr · 2023-10-18T17:18:46Z

Thanks @rstub and @vgherard.

I don't mind moving the code to dqrng, but you could also forward to this package? The link you shared is as good as it gets, it has also made it into a chapter of my thesis: https://www.research-collection.ethz.ch/handle/20.500.11850/183170 . I haven't tried hard enough to publish this.

I don't have the capacity to contribute here in the foreseeable future, but I can commit to keeping this alive and on CRAN.

krlmlr · 2023-10-18T17:24:05Z

I force-pushed. @vgherard: can you please also add documentation on how cccrank is different from the other options? Something along the lines of what you have written in the OP should be good enough.

Can we somehow test that the new implementation is correct? I burned several CPU years to get the numbers in my paper...

krlmlr · 2023-10-18T17:25:26Z

The results in the paper were computed with the help of https://github.com/krlmlr/wrswoR.benchmark.

krlmlr · 2023-10-18T17:28:10Z

Now looking into my old disk for wrswoR.correctness ...

krlmlr · 2023-10-18T18:05:59Z

Found them! Let me know if you're interested in the data, it's rather large. The code is smaller.

vgherard · 2023-10-19T09:33:43Z

Hi @krlmlr , I added a few words on the space complexity of CCCrank in the documentation as you suggested.

I didn't mention time complexity, because I realized that you have used std::partial_sort, which according to the documentation has $N \log (\text{size})$ time complexity, in comparison to the $\text{size}\log (N)$ that could be achieved in principle, as you mention in your thesis.

If I'm not confused, this makes the time complexity of all _rank() functions equal (to $N \log (\text{size})$ ), which would explain my observations in the OP.

I'm looking into validity tests and benchmarks right now - we may also need to fix the statement "*_crank() seems to be faster on most inputs" in the documentation.

vgherard · 2023-10-20T16:48:42Z

Concerning testing algorithm correctness: I set up a simple script that simulates hypothesis testing on the equality of sampling distributions (CCCRank vs. CRank algorithms) - I believe this is more or less similar to the approach you used in your thesis.

I did some basic exploration and (thankfully) there's no signal of a striking difference between sampling distributions, although in order to be quantitative we should analyse power.

If you think it's necessary, we can try to run a systematic analysis, although honestly it seems a bit overkill, since the C++ codes defining these functions are very similar.

# ------------------------------------------------ Testing algorithm correctness
test_sample_int <- function(
  f1 = wrswoR::sample_int_cccrank,
  f2 = wrswoR::sample_int_crank,
  
  n = 50,
  size = 25,
  reps = 1e3,
  
  prob = pexp(3 * ppoints(n)),
  delta_prob = 0,
  
  test_stat = ks.test,
  simulate.p.value = TRUE,
  B = 1e3
) {
  s1 <- replicate(reps, f1(n, size, prob))
  s2 <- replicate(reps, f2(n, size, prob + delta_prob))
  
  test <- test_stat(s1, s2, simulate.p.value = simulate.p.value, B = B)
  return(test$p.value)
}

n <- 50

replicate(100, test_sample_int(reps = 1e3)) -> dd
mean(dd < 0.05)  # FPR is actually conservative (less than .05)
summary(dd)

delta <- 1:n / n
replicate(100, test_sample_int(reps = 1e3, delta_prob = delta)) -> dd1
mean(dd1 < 0.05)  # FNR is very small with such a big effect
summary(dd1)

rstub · 2023-10-21T20:01:15Z

@krlmlr I cannot just forward to wrswoR because the main feature of dqrng are the faster RNGs it provides. With the current development release it is possible to register it as "user-supplied RNG", so that e.g. wrswoR would also be using them. But this indirection does cost some performance. At least this is the case for other sampling methods. I haven't done any comparisons here yet. Anyway, I will adapt the code and run some benchmarks.

krlmlr · 2023-10-22T14:54:38Z

Thanks. Just performing a small sanity check:

set.seed(20231022)
wrswoR::sample_int_rank(10, 10, 3^(10:1))
#>  [1]  1  2  3  4  5  6  8  7 10  9
wrswoR::sample_int_crank(10, 10, 3^(10:1))
#>  [1]  1  2  3  4  5  6  7  8  9 10
wrswoR::sample_int_ccrank(10, 10, 3^(10:1))
#>  [1]  3  1  4  2  5  7  6  8 10  9
wrswoR::sample_int_cccrank(10, 10, 3^(10:1))
#>  [1] 10  9  8  6  5  7  4  3  1  2

^{Created on 2023-10-22 with reprex v2.0.2}

Does cccrank have a different interpretation of prob ?

rstub · 2023-10-22T18:15:29Z

I think the min-heap is filled correctly. But it is then walked from the minimum to the maximum, i.e. the output vector should be filled in reverse order, e.g.

  // Fill result with 'size' indexes k with largest g[k] - equivalent to
  // sampling w/o replacement (Gumbel-Max trick).
  // T ~ O(size * log(size))
  Rcpp::IntegerVector res(size);
  for (size_t k = 0; k < size; ++k) {
    res[size - k - 1] = H.top().index + 1;
    H.pop();
  }

vgherard · 2023-10-23T08:31:09Z

😱 thanks for spotting this. It's as you say, the min-heap was correctly populated with the sampled element, and the set of returned indexes was correctly distributed - that's why it would have escaped the test above.

I have made the correction you suggested above, now the sequences look fine:

set.seed(20231023)
wrswoR::sample_int_rank(10, 10, 3^(10:1))
#>  [1]  1  2  3  4  5  6  7  8  9 10
wrswoR::sample_int_crank(10, 10, 3^(10:1))
#>  [1]  3  2  4  1  7  5  6  8  9 10
wrswoR::sample_int_ccrank(10, 10, 3^(10:1))
#>  [1]  2  1  3  4  6  7  5  8  9 10
wrswoR::sample_int_cccrank(10, 10, 3^(10:1))
#>  [1]  1  2  5  3  4  6  7  8  9 10

^{Created on 2023-10-23 with reprex v2.0.2}

vgherard and others added 7 commits February 9, 2021 13:45

Add cccrank method

95dc824

Add Gumbel-Max reference

9d9d292

Comment test which fails under non 'en' locales

62c5a29

Add reference to Yellott

d8775d8

Add reference to Yellott

eec1334

Merge branch 'master' of https://github.com/vgherard/wrswoR

08c7985

Merge branch 'master' into master

93c34f3

krlmlr and others added 4 commits October 18, 2022 04:41

Merge branch 'main' into master

016e5a7

Merge branch 'krlmlr:main' into master

b7cb1f0

Auto-update from GitHub Actions

d341836

Run: https://github.com/vgherard/wrswoR/actions/runs/3453455746

Auto-update from GitHub Actions

f7b376d

Run: https://github.com/vgherard/wrswoR/actions/runs/3653712319

Merge branch 'krlmlr:main' into master

6c6d340

vgherard added 4 commits October 18, 2023 19:21

Add cccrank method

a177a42

Add Gumbel-Max reference

15ba570

Comment test which fails under non 'en' locales

ca1d072

Add reference to Yellott

340dd2a

krlmlr force-pushed the master branch from 6c6d340 to 340dd2a Compare October 18, 2023 17:22

vgherard added 2 commits October 19, 2023 10:37

Merge branch 'master' of https://github.com/vgherard/wrswoR

2194413

Update documentation on _*crank() methods

a064cf5

Fix order of extraction from min-heap

34aeb12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

References and a memory friendly cccrank method #7

References and a memory friendly cccrank method #7

vgherard commented Feb 9, 2021 •

edited

Loading

codecov bot commented Oct 18, 2022 •

edited

Loading

krlmlr commented Oct 18, 2022

rstub commented Oct 11, 2023

vgherard commented Oct 16, 2023 •

edited

Loading

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023 •

edited

Loading

vgherard commented Oct 19, 2023 •

edited

Loading

vgherard commented Oct 20, 2023

rstub commented Oct 21, 2023

krlmlr commented Oct 22, 2023

rstub commented Oct 22, 2023

vgherard commented Oct 23, 2023

References and a memory friendly cccrank method #7

Are you sure you want to change the base?

References and a memory friendly cccrank method #7

Conversation

vgherard commented Feb 9, 2021 • edited Loading

codecov bot commented Oct 18, 2022 • edited Loading

Welcome to Codecov 🎉

krlmlr commented Oct 18, 2022

rstub commented Oct 11, 2023

vgherard commented Oct 16, 2023 • edited Loading

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023

krlmlr commented Oct 18, 2023 • edited Loading

vgherard commented Oct 19, 2023 • edited Loading

vgherard commented Oct 20, 2023

rstub commented Oct 21, 2023

krlmlr commented Oct 22, 2023

rstub commented Oct 22, 2023

vgherard commented Oct 23, 2023

vgherard commented Feb 9, 2021 •

edited

Loading

codecov bot commented Oct 18, 2022 •

edited

Loading

vgherard commented Oct 16, 2023 •

edited

Loading

krlmlr commented Oct 18, 2023 •

edited

Loading

vgherard commented Oct 19, 2023 •

edited

Loading