-
Notifications
You must be signed in to change notification settings - Fork 72
[DF] Add support for filtered groupby aggregations #760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DF] Add support for filtered groupby aggregations #760
Conversation
| input_col = input_expr.column_name(input_rel) | ||
| if input_col in cc._frontend_backend_mapping: | ||
| continue | ||
| random_name = new_temporary_column(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential issue with new_temporary_column that came to mind while working on this (I say potential because I'm unsure of the behavior of uuid.uuid4()):
Since we're using the table's columns attribute to check that a random column name hasn't been used yet, and we don't actually assign any of these random columns names until several of them have been generated, it is technically possible (though rare) to accidentally assign multiple input / filter columns to the same random backend name, which will certainly cause issues.
Since assign calls are expensive and we ideally want to be adding all required backend columns in a single go, it might make sense to refactor new_temporary_column to instead look at some attribute of the DataContainer or ColumnContainer to check for duplicates, which are both cheaper to update on the fly.
Don't intend to block this PR, but could be worthwhile to open an issue / TODO to handle this down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that moving the check to dc or cc should make things significantly cheaper.
Based on the article here probability of name collision is almost negligible, especially at the scale at which we generate new columns
Codecov Report
@@ Coverage Diff @@
## datafusion-sql-planner #760 +/- ##
=========================================================
Coverage ? 75.55%
=========================================================
Files ? 73
Lines ? 3682
Branches ? 767
=========================================================
Hits ? 2782
Misses ? 766
Partials ? 134 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Adds back support for filtered aggregations with
FILTER (WHERE ...), which allows us to unxfailtest_group_by_filteredand unblocks several related tests in #746 and #759.