Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jun 13, 2023

What changes were proposed in this pull request?

This PR fixes compute stats when BaseAggregateExec nodes above QueryStageExec.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:

SELECT count(*) FROM tbl WHERE false;

The number of shuffle output rows is 0, and the final result is 1. Please see the UI.

Why are the changes needed?

Fix data issue. OptimizeOneRowPlan will use stats to remove Aggregate:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Jun 13, 2023
@wangyum wangyum marked this pull request as draft June 13, 2023 16:35
@wangyum wangyum marked this pull request as ready for review June 13, 2023 16:59
@wangyum wangyum changed the title [SPARK-44040][SQL] Fix compute stats when AggregateExec nodes above QueryStageExec [SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec Jun 13, 2023
@wangyum
Copy link
Member Author

wangyum commented Jun 14, 2023

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @wangyum .

}

override def computeStats(): Statistics = {
// TODO this is not accurate when there is other physical nodes above QueryStageExec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the PR description, we have still issues for the other physical nodes?

Fix data issue. OptimizeOneRowPlan will use stats to remove Aggregate:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked other physical nodes and none of them seem to have this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, shall we remove this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keep it, new physical nodes may be added in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment should be still valid for aggregation with non-empty query stage as this pr only changes the empty case.

Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @wangyum for the fix, looks correct to me

@wangyum
Copy link
Member Author

wangyum commented Jun 16, 2023

@dongjoon-hyun @HyukjinKwon Let's merge this fix for Spark 3.4.1 release?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @wangyum .

@dongjoon-hyun
Copy link
Member

Feel free to merge this PR, @wangyum .

@wangyum wangyum closed this in 55ba63c Jun 16, 2023
wangyum added a commit that referenced this pull request Jun 16, 2023
…eryStageExec

### What changes were proposed in this pull request?

This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:
```sql
SELECT count(*) FROM tbl WHERE false;
```

The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891).

### Why are the changes needed?

Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #41576 from wangyum/SPARK-44040.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 55ba63c)
Signed-off-by: Yuming Wang <[email protected]>
wangyum added a commit that referenced this pull request Jun 16, 2023
…eryStageExec

### What changes were proposed in this pull request?

This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:
```sql
SELECT count(*) FROM tbl WHERE false;
```

The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891).

### Why are the changes needed?

Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #41576 from wangyum/SPARK-44040.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 55ba63c)
Signed-off-by: Yuming Wang <[email protected]>
@wangyum
Copy link
Member Author

wangyum commented Jun 16, 2023

Merged to master, branch-3.4 and branch-3.3.

@wangyum wangyum deleted the SPARK-44040 branch June 16, 2023 04:08
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
…eryStageExec

### What changes were proposed in this pull request?

This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:
```sql
SELECT count(*) FROM tbl WHERE false;
```

The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891).

### Why are the changes needed?

Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes apache#41576 from wangyum/SPARK-44040.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…eryStageExec

### What changes were proposed in this pull request?

This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:
```sql
SELECT count(*) FROM tbl WHERE false;
```

The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891).

### Why are the changes needed?

Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes apache#41576 from wangyum/SPARK-44040.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 55ba63c)
Signed-off-by: Yuming Wang <[email protected]>
GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…eryStageExec

### What changes were proposed in this pull request?

This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:
```sql
SELECT count(*) FROM tbl WHERE false;
```

The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891).

### Why are the changes needed?

Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes apache#41576 from wangyum/SPARK-44040.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 55ba63c)
Signed-off-by: Yuming Wang <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…eryStageExec

### What changes were proposed in this pull request?

This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`.

For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example:
```sql
SELECT count(*) FROM tbl WHERE false;
```

The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891).

### Why are the changes needed?

Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan ===
!Aggregate [id#5L], [id#5L]                                                                                   Project [id#5L]
 +- Union false, false                                                                                        +- Union false, false
    :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])         :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)])
    +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])      +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)])
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes apache#41576 from wangyum/SPARK-44040.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 55ba63c)
Signed-off-by: Yuming Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants