[SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec #41576

wangyum · 2023-06-13T16:35:04Z

What changes were proposed in this pull request?

This PR fixes compute stats when BaseAggregateExec nodes above QueryStageExec.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:

SELECT count(*) FROM tbl WHERE false;

The number of shuffle output rows is 0, and the final result is 1. Please see the UI.

Why are the changes needed?

Fix data issue. OptimizeOneRowPlan will use stats to remove Aggregate:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

wangyum · 2023-06-14T02:32:28Z

@maryannxue @cloud-fan @dongjoon-hyun

dongjoon-hyun · 2023-06-14T03:00:13Z

Thank you for pinging me, @wangyum .

dongjoon-hyun · 2023-06-14T04:06:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStage.scala

  }

  override def computeStats(): Statistics = {
    // TODO this is not accurate when there is other physical nodes above QueryStageExec.


According to the PR description, we have still issues for the other physical nodes?

Fix data issue. OptimizeOneRowPlan will use stats to remove Aggregate:

I've checked other physical nodes and none of them seem to have this issue.

Then, shall we remove this comment?

I think keep it, new physical nodes may be added in the future.

this comment should be still valid for aggregation with non-empty query stage as this pr only changes the empty case.

ulysses-you

thank you @wangyum for the fix, looks correct to me

wangyum · 2023-06-16T03:10:25Z

@dongjoon-hyun @HyukjinKwon Let's merge this fix for Spark 3.4.1 release?

dongjoon-hyun

+1, LGTM. Thank you, @wangyum .

dongjoon-hyun · 2023-06-16T03:12:00Z

Feel free to merge this PR, @wangyum .

…eryStageExec ### What changes were proposed in this pull request? This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`. For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example: ```sql SELECT count(*) FROM tbl WHERE false; ``` The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891). ### Why are the changes needed? Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan === !Aggregate [id#5L], [id#5L] Project [id#5L] +- Union false, false +- Union false, false :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #41576 from wangyum/SPARK-44040. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 55ba63c) Signed-off-by: Yuming Wang <[email protected]>

wangyum · 2023-06-16T04:08:21Z

Merged to master, branch-3.4 and branch-3.3.

HyukjinKwon

LGTM2

…eryStageExec ### What changes were proposed in this pull request? This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`. For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example: ```sql SELECT count(*) FROM tbl WHERE false; ``` The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891). ### Why are the changes needed? Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan === !Aggregate [id#5L], [id#5L] Project [id#5L] +- Union false, false +- Union false, false :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#41576 from wangyum/SPARK-44040. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

…eryStageExec ### What changes were proposed in this pull request? This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`. For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example: ```sql SELECT count(*) FROM tbl WHERE false; ``` The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891). ### Why are the changes needed? Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan === !Aggregate [id#5L], [id#5L] Project [id#5L] +- Union false, false +- Union false, false :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#41576 from wangyum/SPARK-44040. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 55ba63c) Signed-off-by: Yuming Wang <[email protected]>

github-actions bot added the SQL label Jun 13, 2023

wangyum marked this pull request as draft June 13, 2023 16:35

Fix compute stats when AggregateExec nodes above QueryStageExec

d0551db

wangyum force-pushed the SPARK-44040 branch from 6d94c36 to d0551db Compare June 13, 2023 16:39

wangyum marked this pull request as ready for review June 13, 2023 16:59

wangyum changed the title ~~[SPARK-44040][SQL] Fix compute stats when AggregateExec nodes above QueryStageExec~~ [SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec Jun 13, 2023

dongjoon-hyun reviewed Jun 14, 2023

View reviewed changes

fix

91fceea

ulysses-you reviewed Jun 14, 2023

View reviewed changes

dongjoon-hyun approved these changes Jun 16, 2023

View reviewed changes

wangyum closed this in 55ba63c Jun 16, 2023

wangyum deleted the SPARK-44040 branch June 16, 2023 04:08

HyukjinKwon reviewed Jun 19, 2023

View reviewed changes

[SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec #41576

[SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec #41576

Uh oh!

Conversation

wangyum commented Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangyum commented Jun 14, 2023

Uh oh!

dongjoon-hyun commented Jun 14, 2023

Uh oh!

dongjoon-hyun Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

wangyum Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

wangyum Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

ulysses-you Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

ulysses-you left a comment

Choose a reason for hiding this comment

Uh oh!

wangyum commented Jun 16, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 16, 2023

Uh oh!

wangyum commented Jun 16, 2023

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangyum commented Jun 13, 2023 •

edited

Loading