[SPARK-38162][SQL] Optimize one row plan in normal and AQE Optimizer #35473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

ulysses-you wants to merge 5 commits into apache:master from ulysses-you:SPARK-38162

Contributor

ulysses-you commented Feb 10, 2022 •

edited

Loading

What changes were proposed in this pull request?

Add a new rule OptimizeOneMaxRowPlan in normal Optimizer and AQE Optimizer.
Move the similar optimization of EliminateSorts into OptimizeOneMaxRowPlan, also update its comment and test

Why are the changes needed?

Optimize the plan if its max row is equal to or less than 1 in these cases:

if the max rows of the child of sort less than or equal to 1, remove the sort
if the max rows per partition of the child of local sort less than or equal to 1,
remove the local sort
if the max rows of the child of aggregate less than or equal to 1 and its child and
it's grouping only(include the rewritten distinct plan), convert aggregate to project
if the max rows of the child of aggregate less than or equal to 1,
set distinct to false in all aggregate expression

Does this PR introduce any user-facing change?

no, only change the plan

How was this patch tested?

Add a new test OptimizeOneMaxRowPlanSuite for normal optimizer
Add test in AdaptiveQueryExecSuite for AQE optimizer

github-actions bot added the SQL label

Member

HyukjinKwon commented Feb 11, 2022

cc @maryannxue FYI

ulysses-you force-pushed the SPARK-38162 branch 2 times, most recently from 5db385c to 530aec8 Compare

February 11, 2022 03:57


          Optimize one row plan in normal and AQE Optimizer

0734a0e

ulysses-you force-pushed the SPARK-38162 branch from 530aec8 to 0734a0e Compare

February 12, 2022 01:41

Contributor Author

ulysses-you commented Feb 14, 2022

previous test failed caused by bug, now is ok. cc @HyukjinKwon @cloud-fan @maryannxue

Contributor

zhengruifeng commented Feb 15, 2022

A weird case is Sample with withReplacement=true. The underlying SampleExec may output more rows than maxRows.

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated

    
               *   - if the child of aggregate max rows less than or equal to 1, set distinct to false in all

               *     aggregate expression

               */

              object OptimizeOneRowPlan extends Rule[LogicalPlan] {

Contributor

cloud-fan Feb 21, 2022

can we put it in a new file?

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated

    
              /**

               * The rule is applied both normal and AQE Optimizer. It optimizes plan using max rows:

               *   - if the child of sort max rows less than or equal to 1, remove the sort

Contributor

cloud-fan Feb 21, 2022

Suggested change

      
             *   - if the child of sort max rows less than or equal to 1, remove the sort
          
             *   - if the max rows of the child of sort less than or equal to 1, remove the sort

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated

    
                    case Sort(_, _, child) if maxRowNotLargerThanOne(child) => child

                    case Sort(_, false, child) if maxRowPerPartitionNotLargerThanOne(child) => child

                    case agg @ Aggregate(_, _, child) if agg.groupOnly &&

                      agg.outputSet.subsetOf(child.outputSet) && maxRowNotLargerThanOne(child) => child

Contributor

cloud-fan Feb 21, 2022

won't this change the query plan output columns? I think a clear idea is: if child outputs at most one row, we can turn group-only aggregate into a project.

Contributor Author

ulysses-you Feb 21, 2022

yes, it's more general. updated

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated

    
                }

                override def apply(plan: LogicalPlan): LogicalPlan = {

                  plan.transformDownWithPruning(_.containsAnyPattern(SORT, AGGREGATE), ruleId) {

Contributor

cloud-fan Feb 21, 2022

since this rule removes node, I think transform up should be more efficient.

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

    
                  _.containsPattern(SORT))(applyLocally)

                private val applyLocally: PartialFunction[LogicalPlan, LogicalPlan] = {

                  case Sort(_, _, child) if child.maxRows.exists(_ <= 1L) => recursiveRemoveSort(child)

Contributor

cloud-fan Feb 21, 2022

hmm, are you sure the new rule can fully cover this?

Contributor Author

ulysses-you Feb 21, 2022

I think it is. The EliminateLimits only run Once , and the added rule run fixedPoint. It's no harmful since we have transformWithPruning


          address comment

25bf931

ulysses-you commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

    
                      val aliasedExprs = aggregateExprs.map {

                        case ne: NamedExpression => ne

                        case e => Alias(e, e.toString)()

                        case e => UnresolvedAlias(e)

Contributor Author

ulysses-you Feb 21, 2022

it seems a small bug in test, the name will be an unresolved string if there is no alias specified.

ulysses-you added 2 commits

February 22, 2022 09:55

nit

9f75df5

fix

22096cd

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowPlan.scala Outdated

    
              /**

               * The rule is applied both normal and AQE Optimizer. It optimizes plan using max rows:

               *   - if the max rows of the child of sort less than or equal to 1, remove the sort

Contributor

cloud-fan Feb 22, 2022

Suggested change

      
             *   - if the max rows of the child of sort less than or equal to 1, remove the sort
          
             *   - if the max rows of the child of sort is less than or equal to 1, remove the sort

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowPlan.scala Outdated

    
               */

              object OptimizeOneRowPlan extends Rule[LogicalPlan] {

                private def maxRowNotLargerThanOne(plan: LogicalPlan): Boolean = {

                  plan.maxRows.exists(_ <= 1L)

Contributor

cloud-fan Feb 22, 2022

Seems the code itself is simple and clean, we don't need to create a method for it.

cloud-fan reviewed

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Outdated

    
                test("SPARK-38162: Optimize one row plan in AQE Optimizer") {

                  withTempView("v") {

                    spark.sparkContext.parallelize(

                      (1 to 4).map(i => TestData( i, i.toString)), 2)

Contributor

cloud-fan Feb 22, 2022

Suggested change

      
                    (1 to 4).map(i => TestData( i, i.toString)), 2)
          
                    (1 to 4).map(i => TestData(i, i.toString)), 2)

cloud-fan reviewed

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

    
                    // convert group only aggregate to project

                    val (origin2, adaptive2) = runAdaptiveAndVerifyResult(

                      """

                        |SELECT distinct c1 FROM (SELECT /*+ repartition(c1) */ * FROM v where c1 = 1)

Contributor

cloud-fan Feb 22, 2022

what happens if there is no /*+ repartition(c1) */?

Contributor Author

ulysses-you Feb 22, 2022

nothing happens, the aggregate node is inside the logical query stage, so we can not optimize it at logical side:

LogicalQueryStage(logicalAgg: Aggregate, physicalAgg: BaseAggregateExec)

And the plan inside physicalAgg:

BaseAggregateExec final
  ShuffleQueryStage
    Exchange
      BaseAggregateExec partial

cloud-fan reviewed

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

    
                    // remove distinct in aggregate

                    val (origin3, adaptive3) = runAdaptiveAndVerifyResult(

                      """

                        |SELECT sum(distinct c1) FROM (SELECT /*+ repartition(c1) */ * FROM v where c1 = 1)

Contributor

cloud-fan Feb 22, 2022

same question


          address comment

7eb41f6

Contributor

cloud-fan commented Feb 23, 2022

thanks, merging to master!

cloud-fan closed this in

b425156

ulysses-you deleted the SPARK-38162 branch

February 23, 2022 13:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL