Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] slice_sample does not sample randomly #45139

Open
blongworth opened this issue Dec 31, 2024 · 0 comments
Open

[R] slice_sample does not sample randomly #45139

blongworth opened this issue Dec 31, 2024 · 0 comments

Comments

@blongworth
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

slice_sample() in Arrow 18.1.0 does not randomly sample rows from an arrow dataset. Reprex follows. I'd expect a random sample of rows, so x distributed in 1:5000.

library(dplyr)
library(arrow)

df <- data.frame(x = 1:5000)

ds <- arrow_table(df)

slice_sample(ds, n = 10) |> 
  collect()
#> # A tibble: 10 × 1
#>        x
#>    <int>
#>  1     3
#>  2     4
#>  3    16
#>  4    17
#>  5    65
#>  6    66
#>  7    74
#>  8    93
#>  9   123
#> 10   129

Created on 2024-12-31 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       macOS Sonoma 14.7
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-12-31
#>  pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  arrow       * 18.1.0  2024-12-05 [1] CRAN (R 4.4.1)
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.4.0)
#>  bit           4.0.5   2022-11-15 [1] CRAN (R 4.4.0)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.4.0)
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.4.0)
#>  digest        0.6.35  2024-03-11 [1] CRAN (R 4.4.0)
#>  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.4.0)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.4.0)
#>  fs            1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
#>  glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>  knitr         1.46    2024-04-06 [1] CRAN (R 4.4.0)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.0)
#>  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.4.0)
#>  rmarkdown     2.26    2024-03-05 [1] CRAN (R 4.4.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>  withr         3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
#>  xfun          0.49    2024-10-31 [1] CRAN (R 4.4.1)
#>  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.4.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

As this issue could be dangerous for someone assuming a random sample, should there be a note in the docs or slice_sample() removed until it's fixed?

Component(s)

R

@kou kou changed the title slice_sample does not sample randomly [R] slice_sample does not sample randomly Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant