Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Dec 9, 2022

Reprex using CRAN arrow:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()
#> # A tibble: 32 × 4
#> # Groups:   cyl, value [3]
#>      mpg   cyl value `"foo"`
#>    <dbl> <dbl> <dbl> <chr>  
#>  1  21       6     6 foo    
#>  2  21       6     6 foo    
#>  3  22.8     4     4 foo    
#>  4  21.4     6     6 foo    
#>  5  18.7     8     8 foo    
#>  6  18.1     6     6 foo    
#>  7  14.3     8     8 foo    
#>  8  24.4     4     4 foo    
#>  9  22.8     4     4 foo    
#> 10  19.2     6     6 foo    
#> # … with 22 more rows

Created on 2022-12-09 with reprex v2.0.2

After this PR:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()
#> # A tibble: 32 × 3
#> # Groups:   cyl, value [3]
#>      mpg   cyl value
#>    <dbl> <dbl> <chr>
#>  1  21       6 foo  
#>  2  21       6 foo  
#>  3  22.8     4 foo  
#>  4  21.4     6 foo  
#>  5  18.7     8 foo  
#>  6  18.1     6 foo  
#>  7  14.3     8 foo  
#>  8  24.4     4 foo  
#>  9  22.8     4 foo  
#> 10  19.2     6 foo  
#> # … with 22 more rows

Created on 2022-12-09 with reprex v2.0.2

@github-actions
Copy link

github-actions bot commented Dec 9, 2022

@github-actions
Copy link

github-actions bot commented Dec 9, 2022

⚠️ GitHub issue #14872 has been automatically assigned in GitHub to PR creator.

@eitsupi
Copy link
Contributor

eitsupi commented Dec 10, 2022

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()

Wouldn't it be better to add this example to the test?

@paleolimbot
Copy link
Member Author

I did add a test that covers that case (and made sure it failed on current master); however, the test I added matches the style and content the tests around it (e.g., they all use example_data and compare_dplyr_binding()). Is there something that the example is testing that wasn't covered by the test that I added that I missed?

@eitsupi
Copy link
Contributor

eitsupi commented Dec 12, 2022

Is there something that the example is testing that wasn't covered by the test that I added that I missed?

I thought an example of defining a new column in group_by would that case.
It seems quite surprising that both value and "foo" columns are generated in the following example.

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()

@thisisnic
Copy link
Member

I thought an example of defining a new column in group_by would that case. It seems quite surprising that both value and "foo" columns are generated in the following example.

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()

Good catch there, though that looks like a bug which exists separately from this PR as I was able to reproduce this on the tip of the master branch. @paleolimbot - wanna take a look at this here, or shall we open a separate issue and come back to this?

@paleolimbot
Copy link
Member Author

It's definitely worth adding a test! Thanks @eitsupi for the catch. It's definitely related to this PR...the overlapping groups thing results in some very odd arguments to the mutate() embedded in group_by(), which results in all sorts of crazy things happening.

@paleolimbot
Copy link
Member Author

Is there any opposition to merging this?

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see now that @eitsupi 's comment was referring to the behaviour before these changes giving the surprising outcome, which has now been fixed by the changes here.

LGTM!

@paleolimbot paleolimbot merged commit 9753a67 into apache:master Dec 13, 2022
@paleolimbot paleolimbot deleted the fix-group-by branch December 13, 2022 13:26
@ursabot
Copy link

ursabot commented Dec 14, 2022

Benchmark runs are scheduled for baseline = 16d0eb4 and contender = 9753a67. 9753a67 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️25.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.5% ⬆️0.23%] test-mac-arm
[Finished ⬇️12.5% ⬆️4.35%] ursa-i9-9960x
[Finished ⬇️11.16% ⬆️6.79%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 9753a672 ec2-t3-xlarge-us-east-2
[Finished] 9753a672 test-mac-arm
[Finished] 9753a672 ursa-i9-9960x
[Finished] 9753a672 ursa-thinkcentre-m75q
[Finished] 16d0eb4d ec2-t3-xlarge-us-east-2
[Finished] 16d0eb4d test-mac-arm
[Finished] 16d0eb4d ursa-i9-9960x
[Finished] 16d0eb4d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Dec 14, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R] arrow returns wrong variable content when multiple group_by/summarise statements are used

4 participants