-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Currently, filter():
- Retains
TRUE - Drops
FALSEandNA - This matches
subset()
A number of requests have come up in the past desiring:
- Retains
TRUEandNA - Drops
FALSE - This matches
[
Here are a few:
- https://stackoverflow.com/questions/46378437/how-to-filter-data-without-losing-na-rows-using-dplyr (17k views)
- https://stackoverflow.com/questions/32908589/why-does-dplyrs-filter-drop-na-values-from-a-factor-variable (8k views)
filteroption to keepNAs #6432 (august this year)- Filtering != treatment of NAs is unintuitive #6013
- highlight that filter removes NA values by default in documentation. #4478 (requesting more docs about this)
- filter not retaining rows with NA values #3196
- Enhancement proposition for filter #812
- Consider adding a ``drop'' rows operation #1527 (more about drop(), but still related to NAs)
- My wife has brought this up to me at least 3 times
This is most apparently annoying when you have multiple columns to filter by
library(dplyr)
df <- tibble(
x = c(TRUE, FALSE, NA, NA, NA),
y = c(NA, TRUE, NA, NA, NA),
z = c(TRUE, TRUE, TRUE, FALSE, NA)
)
df
#> # A tibble: 5 × 3
#> x y z
#> <lgl> <lgl> <lgl>
#> 1 TRUE NA TRUE
#> 2 FALSE TRUE TRUE
#> 3 NA NA TRUE
#> 4 NA NA FALSE
#> 5 NA NA NA
filter(df, x, y, z)
#> # A tibble: 0 × 3
#> # … with 3 variables: x <lgl>, y <lgl>, z <lgl>
filter(df, x | is.na(x), y | is.na(y), z | is.na(z))
#> # A tibble: 3 × 3
#> x y z
#> <lgl> <lgl> <lgl>
#> 1 TRUE NA TRUE
#> 2 NA NA TRUE
#> 3 NA NA NAI propose a .missing = c("drop", "keep", "error") argument to filter() that would allow you to optionally keep rows with NA.
We'd have to carefully analyze the boolean algebra here to make sure we are being consistent. In particular I think we want to make sure these are the same if we do this, but I think they are:
# these should be the same
filter(df, x, y, .missing = "drop")
filter(df, x & y, .missing = "drop")
# these should be the same
filter(df, x, y, .missing = "keep")
filter(df, x & y, .missing = "keep")The "drop" case is probably already consistent because that is what we do today, and the "keep" case is probably like this, which seems consistent
na_to_true <- function(x) {
x[is.na(x)] <- TRUE
x
}
na_to_true(TRUE & NA)
#> [1] TRUE
na_to_true(TRUE) & na_to_true(NA)
#> [1] TRUEWhen we do this, we should also think about whether vec_pall() or vec_pany() could be used in filter() in any way, since they are heavily optimized for performance.