Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr and tibble collision possible bug #1936

Closed
ghost opened this issue Jun 17, 2016 · 18 comments
Closed

dplyr and tibble collision possible bug #1936

ghost opened this issue Jun 17, 2016 · 18 comments

Comments

@ghost
Copy link

ghost commented Jun 17, 2016

Using R 3.3.0, dplyr 0.4.3, tibble 1.0

x <- dplyr::data_frame(foo = "bar")
x$qux

This returns NULL, as expected because the column 'qux' doesn't exits.

Now run this:

y <- tibble::data_frame(fizz = "buzz")
x$qux

This tells me "Error: Unknown column 'qux'" ...which doesn't make any sense because I haven't changed or touched x.

Note that this bug was discovered during our migration from 3.2 to 3.3

@hadley
Copy link
Member

hadley commented Jun 18, 2016

That will be the default as soon as dplyr 0.5 is out next week

@hadley hadley closed this as completed Jun 18, 2016
@kevinushey
Copy link
Contributor

kevinushey commented Jun 18, 2016

This seems like a dangerous change. Won't it break any packages that do something like is.null(df$foo) to test for the existence of foo in df?

@ghost
Copy link
Author

ghost commented Jun 18, 2016

yeah this is a huge change. This is how it should have worked from day 1, but it's going to be hard when data.frame still returns NULL

@krlmlr
Copy link
Member

krlmlr commented Jun 18, 2016

I'm seeing:

> x <- dplyr::data_frame(foo = "bar")
> x$qux
Error in stopc("Unknown column '", i, "'") : Unknown column 'qux'
5: traceback(1)
4: stop(..., domain = NA) at utils.r#53
3: stopc("Unknown column '", i, "'") at tbl-df.r#37
2: `$.tbl_df`(x, qux)
1: x$qux

We should do tidyverse/tibble#102 as an alternative.

@hadley
Copy link
Member

hadley commented Jun 18, 2016

We'll see. Having [ always return a data frame caused some problems in the early days too. But I'm confident that in the long run it will fix more problems than it creates.

@ghost
Copy link
Author

ghost commented Jun 19, 2016

has_column() should be included when this change goes live, that's a really good idea

But also, don't you think that this change will cause a lot of confusion considering that base::data.frame returns NULL when a column isn't found? I agree that this should never have been how this works (should have always been an error), but isn't going going to break a lot of legacy code?

@kevinushey
Copy link
Contributor

Perhaps tbls should stop inheriting from data.frame since they don't preserve the semantics of the underlying interface. (he said, snarkily)

To be honest, though, I don't like this change at all -- I think it's well understood that $ returns NULL when the element does not exist. This is how $ generally works across all R types -- why start changing those semantics now?

@krlmlr
Copy link
Member

krlmlr commented Jun 20, 2016

I got hurt badly once when trying to replace all data frames by data tables in existing code. Still, data tables are data frames.

I think we should be fine if dplyr makes sure that all operations return data frames from df input, and tibbles from tibble input. Switching to tibble will then be a choice made consciously, with all its alleged pros and cons.

Are there currently operations that return a tibble when passed a df?

@hadley
Copy link
Member

hadley commented Jun 20, 2016

Hmmmmm, maybe I didn't think this fully through - I didn't think about the use case of is.null(df$x). Personally, I never use this pattern (as I think it's relying on a subtle bug in R), but I agree that it is common in practice. This will become more problematic since more packages are importing tibble, not just suggesting it.

OTOH, there are lots of subtle bugs that people are probably currently missing:

df <- data.frame(xyz =1:3)
is.null(df$x)
# [1] FALSE

And the current behaviour of $ is particularly annoying when used interactively as recycling will mean that the error you get is typically not the root cause.

@hadley
Copy link
Member

hadley commented Jun 20, 2016

In my mind, this change might potentially require some changes to user code, but it forces an early error in many cases where you currently get a late error.

Somewhat unusually (compared to the rest of tibble), this change is most important when working interactively, and less important when programming.

@hadley
Copy link
Member

hadley commented Jun 20, 2016

I had thought that this was only in the dev version of tibble, but it is in the released version, which has been available for ~2 months. This is the first complaint I've heard, but that might be because most people are exposed to tibbles through dplyr.

Maybe we should not throw an error for [[ as that's a more "programmatic" interface, and might be more used by packages. $ would continue to protect users from inadvertent interactive errors.

@hadley hadley reopened this Jun 20, 2016
@hadley
Copy link
Member

hadley commented Jun 20, 2016

Then we'd have:

  • [: use when you want a smaller tibble
  • [[: use when you want a single column (returns null)
  • $: use when you want a single column (throws error)

That seems like a reasonable set of principles, and a reasonable compromise with the current behaviour.

@krlmlr
Copy link
Member

krlmlr commented Jun 20, 2016

Not the other way round -- [[ errs and $ gives NULL?

@krlmlr
Copy link
Member

krlmlr commented Jun 20, 2016

Ah, I see. The original complaint was about that $ throws an error, though.

@krlmlr
Copy link
Member

krlmlr commented Jun 20, 2016

I'd argue that if an operator is used in packages, we want to have an error, just like the current behavior of [ and [[. Errors in packages will be detected with R CMD check, too. The situation may be different with analysis scripts.

@hadley
Copy link
Member

hadley commented Jun 20, 2016

I think we do want to have an error but we need some easy way that people can preserve existing behaviour - relying on a function in tibble isn't sufficient, because the package might not use tibble. We could encourage people to use "x" %in% names(df) but switching from is.null(df$x) to is.null(df[["x"]]) is easier (and for regular data frames also protects against partial matching).

I've been thinking about framing tibbles as lazy, curmudgeonly data frames, i.e. they only do the minimal amount of work (they never coerce) and they complain whenever something isn't as they expect.

@cboettig
Copy link

I agree with @kevinushey that this change is surprising given the usual behavior of $ and the expectation that tibbles inherit data.frame behavior. I've had some code break in a subtle way due to the last tidyrrelease causing this behavior to be adopted in output data.frames, which is particularly confusing since dplyr still has the old behavior, making for conflicting behavior of tbl_df (though glad this week's release of dplyr should fix that).

Still, an opt-out option could at least ease the transition.

@hadley
Copy link
Member

hadley commented Jun 27, 2016

Closing here since implementation needs to happen in tibble.

@hadley hadley closed this as completed Jun 27, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants