Skip to content

Conversation

@charlesbluca
Copy link
Collaborator

@charlesbluca charlesbluca commented Sep 13, 2022

Adds back support for filtered aggregations with FILTER (WHERE ...), which allows us to unxfail test_group_by_filtered and unblocks several related tests in #746 and #759.

input_col = input_expr.column_name(input_rel)
if input_col in cc._frontend_backend_mapping:
continue
random_name = new_temporary_column(df)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential issue with new_temporary_column that came to mind while working on this (I say potential because I'm unsure of the behavior of uuid.uuid4()):

Since we're using the table's columns attribute to check that a random column name hasn't been used yet, and we don't actually assign any of these random columns names until several of them have been generated, it is technically possible (though rare) to accidentally assign multiple input / filter columns to the same random backend name, which will certainly cause issues.

Since assign calls are expensive and we ideally want to be adding all required backend columns in a single go, it might make sense to refactor new_temporary_column to instead look at some attribute of the DataContainer or ColumnContainer to check for duplicates, which are both cheaper to update on the fly.

Don't intend to block this PR, but could be worthwhile to open an issue / TODO to handle this down the line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that moving the check to dc or cc should make things significantly cheaper.
Based on the article here probability of name collision is almost negligible, especially at the scale at which we generate new columns

@charlesbluca charlesbluca marked this pull request as ready for review September 20, 2022 19:39
@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (datafusion-sql-planner@528108c). Click here to learn what that means.
The diff coverage is n/a.

@@                    Coverage Diff                    @@
##             datafusion-sql-planner     #760   +/-   ##
=========================================================
  Coverage                          ?   75.55%           
=========================================================
  Files                             ?       73           
  Lines                             ?     3682           
  Branches                          ?      767           
=========================================================
  Hits                              ?     2782           
  Misses                            ?      766           
  Partials                          ?      134           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ayushdg ayushdg merged commit 2af0b06 into dask-contrib:datafusion-sql-planner Sep 21, 2022
@charlesbluca charlesbluca deleted the df-filter-where branch February 5, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants