-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter #38619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case Join(_, agg: Aggregate, LeftSemi, _, _) => agg | ||
| } | ||
| assert(agg.size == 1) | ||
| assert(agg.head.fastEquals(ColumnPruning(agg.head))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test can pass without this pr because
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
Lines 68 to 79 in 38897b1
| Batch("Extract Python UDFs", Once, | |
| ExtractPythonUDFFromJoinCondition, | |
| // `ExtractPythonUDFFromJoinCondition` can convert a join to a cartesian product. | |
| // Here, we rerun cartesian product check. | |
| CheckCartesianProducts, | |
| ExtractPythonUDFFromAggregate, | |
| // This must be executed after `ExtractPythonUDFFromAggregate` and before `ExtractPythonUDFs`. | |
| ExtractGroupingPythonUDFFromAggregate, | |
| ExtractPythonUDFs, | |
| // The eval-python node may be between Project/Filter and the scan node, which breaks | |
| // column pruning and filter push-down. Here we rerun the related optimizer rules. | |
| ColumnPruning, |
I think it is just a coincidence since we converted the subquery to left semi join ..
|
cc @wangyum @cloud-fan @sigmod thank you |
| val actualFilterKeyExpr = mayWrapWithHash(filterCreationSideExp) | ||
| val alias = Alias(actualFilterKeyExpr, actualFilterKeyExpr.toString)() | ||
| val aggregate = Aggregate(Seq(alias), Seq(alias), filterCreationSidePlan) | ||
| val aggregate = ColumnPruning(Aggregate(Seq(alias), Seq(alias), filterCreationSidePlan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit hacky. Can we use the catalyst framework to optimize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that will be overkill to apply the whole optimizer rules. The filterCreationSidePlan is simple enough, it can only contain project and filter. Apply ColumnPruning here seems more safer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about the filter push rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filterCreationSidePlan has been optimized , so it should be done if there is a filter can be pushed.
I think one more useful rule is CollapseProject, but it should be fine not to apply here since PhysicalOperation support collect adjacent projects.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
thanks, merging to master! |
|
thank you @cloud-fan @dongjoon-hyun |
…ith in-subquery filter ### What changes were proposed in this pull request? Apply ColumnPruning for in subquery filter. Note that, the bloom filter side has already fixed by apache#36047 ### Why are the changes needed? The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes apache#38619 from ulysses-you/SPARK-41112. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Apply ColumnPruning for in subquery filter.
Note that, the bloom filter side has already fixed by #36047
Why are the changes needed?
The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected.
Does this PR introduce any user-facing change?
no
How was this patch tested?
add test