Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidyselect + where VS if_any/if_all, and error messages #5732

Closed
jthomasmock opened this issue Feb 3, 2021 · 4 comments · Fixed by #5847
Closed

tidyselect + where VS if_any/if_all, and error messages #5732

jthomasmock opened this issue Feb 3, 2021 · 4 comments · Fixed by #5847
Labels
feature a feature request or enhancement selection 🧺 tidyselect, scoped verbs, etc.

Comments

@jthomasmock
Copy link

Number one, extremely happy to have the new features for filter and/or mutate() with if_any/if_all!!! This feature was not something I expected but am so excited to have!!!

🥳

There is one problem I'd like to raise and one question:

Problem: The error message for using if_any() without a column selector results in some potentially bad recommendations (related to my question as well).

Question: What is the "correct" way of select-ing columns based on values within that column?


While I understand the use of if_any with filter(), I'm also interested in the similar option for select(). when cleaning data or working with "wide" data that I'm putting into a table I fairly often want to select() specific columns that contain a specific value/string/range/etc. IE just knowing it's numeric or text is not sufficient, but rather I'm interested in specifics of the column's contents.

I have a reprex below that shows some of my testing steps in trying to answer this question.

library(dplyr, warn.conflicts = FALSE)
library(palmerpenguins)
#> Warning: package 'palmerpenguins' was built under R version 4.0.2


# Filter specific columns 

ex1 <- penguins %>% 
  filter(if_any(contains("bill"), ~.x == 39.1| .x == 18.7))

ex1
#> # A tibble: 6 x 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           39.1          18.7              181        3750 male 
#> 2 Adelie  Biscoe           37.7          18.7              180        3600 male 
#> 3 Adelie  Dream            39            18.7              185        3650 male 
#> 4 Chinst… Dream            45.4          18.7              188        3525 fema…
#> 5 Chinst… Dream            51.5          18.7              187        3250 male 
#> 6 Chinst… Dream            50.2          18.7              198        3775 fema…
#> # … with 1 more variable: year <int>

# I realize this is a toy example 
# that could be solved by normal filter
ex2 <- penguins %>% 
  filter(bill_length_mm == 39.1| bill_depth_mm == 18.7)

all.equal(ex1, ex2)
#> [1] TRUE

# Skip tidyselect ---------------------------------------------------------

# What happens when you don't indicate tidyselect or colname?
# Throws a useful error but in this case leads down
# a not very useful path
penguins %>% 
  filter(if_any(~.x == 39.1))
#> Error: Problem with `filter()` input `..1`.
#> x Formula shorthand must be wrapped in `where()`.
#> 
#>   # Bad
#>   data %>% select(~.x == 39.1)
#> 
#>   # Good
#>   data %>% select(where(~.x == 39.1))
#> ℹ Input `..1` is `if_any(~.x == 39.1)`.

# So then let's follow the recommendation of:
# data %>% select(where(~.x == 39.1))
# this fails because it's not vectorized I believe?

penguins %>% 
  select(where(~.x == 39.1))
#> Error: `where()` must be used with functions that return `TRUE` or `FALSE`.

# Still fails with everything() indicated
penguins %>% 
  select(where(if_any(everything(), ~.x == 39.1)))
#> Error: `across()` must only be used inside dplyr verbs.

# Use base::any ---------------------------------------------------------

# So for this behavior, I want to select a specific column
# by values present in that column
# this WILL work if you use base::any() without any extra work!!!
# But I wanted to make sure this is a good idea?

penguins %>% 
  select(where(~any(.x == 36.7, na.rm = TRUE)))
#> # A tibble: 344 x 1
#>    bill_length_mm
#>             <dbl>
#>  1           39.1
#>  2           39.5
#>  3           40.3
#>  4           NA  
#>  5           36.7
#>  6           39.3
#>  7           38.9
#>  8           39.2
#>  9           34.1
#> 10           42  
#> # … with 334 more rows

Created on 2021-02-03 by the reprex package (v0.3.0)

@romainfrancois
Copy link
Member

🤔 I believe you are led on a wrong paths, and that filter(if_any(~.x == 39.1)) should straight up be illegal. Trying to go in this direction in #5752.

We tried to have .cols = everything() be the second argument instead of the first, but ruled it out because it was confusing wrt across() ...

@romainfrancois romainfrancois added this to the 1.0.5 milestone Feb 11, 2021
@romainfrancois
Copy link
Member

This is a tricky one, perhaps if_any() / if_all() need to go back on the chalkboard for a bit.

Perhaps the default for .cols= should be everything() which is what it is in across() so that at least you would be able to do:

library(dplyr, warn.conflicts = FALSE)
library(palmerpenguins)

penguins %>% 
  filter(if_any(.fns = ~.x == 39.1))
#> # A tibble: 1 x 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           39.1          18.7              181        3750 male 
#> # … with 1 more variable: year <int>

Created on 2021-02-17 by the reprex package (v0.3.0)

Perhaps .fns = is not the right name then, perhaps it should only be allowed to be a single function instead of potentially a list of functions.

Perhaps the call to across() should be instrumented so that we catch the error and make it more useful in this context.

Or perhaps the documentation should include some examples to explain that the initial code is off label ?

@romainfrancois romainfrancois removed this from the 1.0.5 milestone Feb 18, 2021
@jthomasmock
Copy link
Author

I agree that it's slightly confusing that the error suggests select() when the user is trying to use filter(), and that not having an indicated column should probably error.

🤔 I believe you are led on a wrong paths, and that filter(if_any(~.x == 39.1)) should straight up be illegal. Trying to go in this direction in #5752.

We tried to have .cols = everything() be the second argument instead of the first, but ruled it out because it was confusing wrt across() ...

Perhaps if a column is not indicated, the error could suggest something like you mentioned above:

#> Error: Problem with filter() input ..1.
No columns indicated, please indicate columns with tidyselect, or use .cols = everything() to filter across all columns

@hadley
Copy link
Member

hadley commented Mar 31, 2021

It's going to be hard to give a more specific error message for this case, but I wonder if in this error:

library(dplyr)

mtcars %>% 
  filter(if_any(~.x == 39.1))
#> Error: Problem with `filter()` input `..1`.
#> x Formula shorthand must be wrapped in `where()`.
#> 
#>   # Bad
#>   data %>% select(~.x == 39.1)
#> 
#>   # Good
#>   data %>% select(where(~.x == 39.1))
#> ℹ Input `..1` is `if_any(~.x == 39.1)`.

We could at least not mention select(). So I think the place to start would be to create a reprex that only uses tidyselect, and then consider how we might parameterise eval_select() to get a better error here:

#>   # Bad
#>   if_any(~.x == 39.1)
#> 
#>   # Good
#>   if_any(where(~.x == 39.1))

Or even just suppressing the error would be better than the current behaviour.

That at least gets you to:

mtcars %>% 
   filter(if_any(where(~.x == 39.1)))
#> Error: Problem with `filter()` input `..1`.
#> x `where()` must be used with functions that return `TRUE` or `FALSE`.
#> ℹ Input `..1` is `if_any(where(~.x == 39.1))`.

And that message really needs to emphasise a "single" TRUE or FALSE.

I think that's the best we can do; this is the downside NSE.

romainfrancois added a commit that referenced this issue Apr 9, 2021
@hadley hadley added feature a feature request or enhancement selection 🧺 tidyselect, scoped verbs, etc. labels Apr 19, 2021
romainfrancois added a commit that referenced this issue Apr 19, 2021
romainfrancois added a commit that referenced this issue Apr 20, 2021
romainfrancois added a commit that referenced this issue Apr 21, 2021
romainfrancois added a commit that referenced this issue Apr 22, 2021
* try an instrumenting approach

closes #5732

* try not to instrument error but detect misplaced ~

* only do it in if_any() and if_all()

* across() also aborts on formula .cols

* Update R/across.R

Co-authored-by: Lionel Henry <[email protected]>

* mention next step

* simplify error

* + test

* rebase hickup

* use context_local() / context_peek()

* using expect_snapshot() + expect_error() combo

* full message with expanded if_any()/if_all()

Co-authored-by: Lionel Henry <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement selection 🧺 tidyselect, scoped verbs, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants