-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-38162][SQL] Optimize one row plan in normal and AQE Optimizer #35473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @maryannxue FYI |
5db385c to
530aec8
Compare
530aec8 to
0734a0e
Compare
|
previous test failed caused by bug, now is ok. cc @HyukjinKwon @cloud-fan @maryannxue |
|
A weird case is see #35250 |
| * - if the child of aggregate max rows less than or equal to 1, set distinct to false in all | ||
| * aggregate expression | ||
| */ | ||
| object OptimizeOneRowPlan extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put it in a new file?
|
|
||
| /** | ||
| * The rule is applied both normal and AQE Optimizer. It optimizes plan using max rows: | ||
| * - if the child of sort max rows less than or equal to 1, remove the sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * - if the child of sort max rows less than or equal to 1, remove the sort | |
| * - if the max rows of the child of sort less than or equal to 1, remove the sort |
| case Sort(_, _, child) if maxRowNotLargerThanOne(child) => child | ||
| case Sort(_, false, child) if maxRowPerPartitionNotLargerThanOne(child) => child | ||
| case agg @ Aggregate(_, _, child) if agg.groupOnly && | ||
| agg.outputSet.subsetOf(child.outputSet) && maxRowNotLargerThanOne(child) => child |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this change the query plan output columns? I think a clear idea is: if child outputs at most one row, we can turn group-only aggregate into a project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's more general. updated
| } | ||
|
|
||
| override def apply(plan: LogicalPlan): LogicalPlan = { | ||
| plan.transformDownWithPruning(_.containsAnyPattern(SORT, AGGREGATE), ruleId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this rule removes node, I think transform up should be more efficient.
| _.containsPattern(SORT))(applyLocally) | ||
|
|
||
| private val applyLocally: PartialFunction[LogicalPlan, LogicalPlan] = { | ||
| case Sort(_, _, child) if child.maxRows.exists(_ <= 1L) => recursiveRemoveSort(child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, are you sure the new rule can fully cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is. The EliminateLimits only run Once , and the added rule run fixedPoint. It's no harmful since we have transformWithPruning
| val aliasedExprs = aggregateExprs.map { | ||
| case ne: NamedExpression => ne | ||
| case e => Alias(e, e.toString)() | ||
| case e => UnresolvedAlias(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems a small bug in test, the name will be an unresolved string if there is no alias specified.
|
|
||
| /** | ||
| * The rule is applied both normal and AQE Optimizer. It optimizes plan using max rows: | ||
| * - if the max rows of the child of sort less than or equal to 1, remove the sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * - if the max rows of the child of sort less than or equal to 1, remove the sort | |
| * - if the max rows of the child of sort is less than or equal to 1, remove the sort |
| */ | ||
| object OptimizeOneRowPlan extends Rule[LogicalPlan] { | ||
| private def maxRowNotLargerThanOne(plan: LogicalPlan): Boolean = { | ||
| plan.maxRows.exists(_ <= 1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the code itself is simple and clean, we don't need to create a method for it.
| test("SPARK-38162: Optimize one row plan in AQE Optimizer") { | ||
| withTempView("v") { | ||
| spark.sparkContext.parallelize( | ||
| (1 to 4).map(i => TestData( i, i.toString)), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (1 to 4).map(i => TestData( i, i.toString)), 2) | |
| (1 to 4).map(i => TestData(i, i.toString)), 2) |
| // convert group only aggregate to project | ||
| val (origin2, adaptive2) = runAdaptiveAndVerifyResult( | ||
| """ | ||
| |SELECT distinct c1 FROM (SELECT /*+ repartition(c1) */ * FROM v where c1 = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if there is no /*+ repartition(c1) */?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing happens, the aggregate node is inside the logical query stage, so we can not optimize it at logical side:
LogicalQueryStage(logicalAgg: Aggregate, physicalAgg: BaseAggregateExec)
And the plan inside physicalAgg:
BaseAggregateExec final
ShuffleQueryStage
Exchange
BaseAggregateExec partial| // remove distinct in aggregate | ||
| val (origin3, adaptive3) = runAdaptiveAndVerifyResult( | ||
| """ | ||
| |SELECT sum(distinct c1) FROM (SELECT /*+ repartition(c1) */ * FROM v where c1 = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question
|
thanks, merging to master! |
What changes were proposed in this pull request?
OptimizeOneMaxRowPlanin normal Optimizer and AQE Optimizer.EliminateSortsintoOptimizeOneMaxRowPlan, also update its comment and testWhy are the changes needed?
Optimize the plan if its max row is equal to or less than 1 in these cases:
remove the local sort
it's grouping only(include the rewritten distinct plan), convert aggregate to project
set distinct to false in all aggregate expression
Does this PR introduce any user-facing change?
no, only change the plan
How was this patch tested?
OptimizeOneMaxRowPlanSuitefor normal optimizerAdaptiveQueryExecSuitefor AQE optimizer