`filter(.missing = )` option to optionally retain missing values

Currently, `filter()`:
- Retains `TRUE`
- Drops `FALSE` and `NA`
- This matches `subset()`

A number of requests have come up in the past desiring:
- Retains `TRUE` and `NA`
- Drops `FALSE`
- This matches `[`

Here are a few:
* https://stackoverflow.com/questions/46378437/how-to-filter-data-without-losing-na-rows-using-dplyr (17k views)
* https://stackoverflow.com/questions/32908589/why-does-dplyrs-filter-drop-na-values-from-a-factor-variable (8k views)
* https://github.com/tidyverse/dplyr/issues/6432 (august this year)
* https://github.com/tidyverse/dplyr/issues/6013
* https://github.com/tidyverse/dplyr/issues/4478 (requesting more docs about this)
* https://github.com/tidyverse/dplyr/issues/3196
* https://github.com/tidyverse/dplyr/issues/812
* https://github.com/tidyverse/dplyr/issues/1527 (more about drop(), but still related to NAs)
* My wife has brought this up to me at least 3 times

This is most apparently annoying when you have multiple columns to filter by

``` r
library(dplyr)

df <- tibble(
  x = c(TRUE, FALSE, NA, NA, NA),
  y = c(NA, TRUE, NA, NA, NA),
  z = c(TRUE, TRUE, TRUE, FALSE, NA)
)
df
#> # A tibble: 5 × 3
#>   x     y     z    
#>   <lgl> <lgl> <lgl>
#> 1 TRUE  NA    TRUE 
#> 2 FALSE TRUE  TRUE 
#> 3 NA    NA    TRUE 
#> 4 NA    NA    FALSE
#> 5 NA    NA    NA

filter(df, x, y, z)
#> # A tibble: 0 × 3
#> # … with 3 variables: x <lgl>, y <lgl>, z <lgl>

filter(df, x | is.na(x), y | is.na(y), z | is.na(z))
#> # A tibble: 3 × 3
#>   x     y     z    
#>   <lgl> <lgl> <lgl>
#> 1 TRUE  NA    TRUE 
#> 2 NA    NA    TRUE 
#> 3 NA    NA    NA
```

I propose a `.missing = c("drop", "keep", "error")` argument to `filter()` that would allow you to optionally keep rows with `NA`.

We'd have to carefully analyze the boolean algebra here to make sure we are being consistent. In particular I think we want to make sure these are the same if we do this, but I think they are:

```r
# these should be the same
filter(df, x, y, .missing = "drop")
filter(df, x & y, .missing = "drop")

# these should be the same
filter(df, x, y, .missing = "keep")
filter(df, x & y, .missing = "keep")
```

The `"drop"` case is probably already consistent because that is what we do today, and the `"keep"` case is probably like this, which seems consistent

``` r
na_to_true <- function(x) {
  x[is.na(x)] <- TRUE
  x
}

na_to_true(TRUE & NA)
#> [1] TRUE
na_to_true(TRUE) & na_to_true(NA)
#> [1] TRUE
```

---

When we do this, we should also think about whether `vec_pall()` or `vec_pany()` could be used in `filter()` in any way, since they are heavily optimized for performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`filter(.missing = )` option to optionally retain missing values #6560

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

filter(.missing = ) option to optionally retain missing values #6560

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`filter(.missing = )` option to optionally retain missing values #6560