Commit e00b81e
[SPARK-34079][SQL] Merge non-correlated scalar subqueries
### What changes were proposed in this pull request?
This PR adds a new optimizer rule `MergeScalarSubqueries` to merge multiple non-correlated `ScalarSubquery`s to compute multiple scalar values once.
E.g. the following query:
```
SELECT
(SELECT avg(a) FROM t),
(SELECT sum(b) FROM t)
```
is optimized from:
```
== Optimized Logical Plan ==
Project [scalar-subquery#242 [] AS scalarsubquery()#253, scalar-subquery#243 [] AS scalarsubquery()#254L]
: :- Aggregate [avg(a#244) AS avg(a)#247]
: : +- Project [a#244]
: : +- Relation default.t[a#244,b#245] parquet
: +- Aggregate [sum(a#251) AS sum(a)#250L]
: +- Project [a#251]
: +- Relation default.t[a#251,b#252] parquet
+- OneRowRelation
```
to:
```
== Optimized Logical Plan ==
Project [scalar-subquery#242 [].avg(a) AS scalarsubquery()#253, scalar-subquery#243 [].sum(a) AS scalarsubquery()#254L]
: :- Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260]
: : +- Aggregate [avg(a#244) AS avg(a)#247, sum(a#244) AS sum(a)#250L]
: : +- Project [a#244]
: : +- Relation default.t[a#244,b#245] parquet
: +- Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260]
: +- Aggregate [avg(a#244) AS avg(a)#247, sum(a#244) AS sum(a)#250L]
: +- Project [a#244]
: +- Relation default.t[a#244,b#245] parquet
+- OneRowRelation
```
and in the physical plan subqueries are reused:
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
*(1) Project [Subquery subquery#242, [id=#113].avg(a) AS scalarsubquery()#253, ReusedSubquery Subquery subquery#242, [id=#113].sum(a) AS scalarsubquery()#254L]
: :- Subquery subquery#242, [id=#113]
: : +- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
*(2) Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260]
+- *(2) HashAggregate(keys=[], functions=[avg(a#244), sum(a#244)], output=[avg(a)#247, sum(a)#250L])
+- ShuffleQueryStage 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#158]
+- *(1) HashAggregate(keys=[], functions=[partial_avg(a#244), partial_sum(a#244)], output=[sum#262, count#263L, sum#264L])
+- *(1) ColumnarToRow
+- FileScan parquet default.t[a#244] Batched: true, DataFilters: [], Format: Parquet, Location: ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
+- == Initial Plan ==
Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260]
+- HashAggregate(keys=[], functions=[avg(a#244), sum(a#244)], output=[avg(a)#247, sum(a)#250L])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#110]
+- HashAggregate(keys=[], functions=[partial_avg(a#244), partial_sum(a#244)], output=[sum#262, count#263L, sum#264L])
+- FileScan parquet default.t[a#244] Batched: true, DataFilters: [], Format: Parquet, Location: ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
: +- ReusedSubquery Subquery subquery#242, [id=#113]
+- *(1) Scan OneRowRelation[]
+- == Initial Plan ==
...
```
Please note that the above simple example could be easily optimized into a common select expression without reuse node, but this PR can handle more complex queries as well.
### Why are the changes needed?
Performance improvement.
```
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q9 - MergeScalarSubqueries off 50798 52521 1423 0.0 Infinity 1.0X
[info] q9 - MergeScalarSubqueries on 19484 19675 226 0.0 Infinity 2.6X
[info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q9b - MergeScalarSubqueries off 15430 17803 NaN 0.0 Infinity 1.0X
[info] q9b - MergeScalarSubqueries on 3862 4002 196 0.0 Infinity 4.0X
```
Please find `q9b` in the description of SPARK-34079. It is a variant of [q9.sql](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q9.sql) using CTE.
The performance improvement in case of `q9` comes from merging 15 subqueries into 5 and in case of `q9b` it comes from merging 5 subqueries into 1.
### Does this PR introduce _any_ user-facing change?
No. But this optimization can be disabled with `spark.sql.optimizer.excludedRules` config.
### How was this patch tested?
Existing and new UTs.
Closes #32298 from peter-toth/SPARK-34079-multi-column-scalar-subquery.
Lead-authored-by: Peter Toth <[email protected]>
Co-authored-by: attilapiros <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 21e48b7 commit e00b81e
File tree
19 files changed
+1706
-1600
lines changed- sql
- catalyst/src
- main/scala/org/apache/spark/sql/catalyst
- expressions
- optimizer
- plans/logical
- trees
- test/scala/org/apache/spark/sql/catalyst/optimizer
- core/src
- main
- java/org/apache/spark/sql/execution
- scala/org/apache/spark/sql/execution
- aggregate
- test
- resources/tpcds-plan-stability/approved-plans-v1_4
- q9.sf100
- q9
- scala/org/apache/spark/sql
- execution
19 files changed
+1706
-1600
lines changedLines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
59 | 62 | | |
60 | 63 | | |
61 | 64 | | |
| |||
Lines changed: 389 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
121 | 121 | | |
122 | 122 | | |
123 | 123 | | |
124 | | - | |
| 124 | + | |
125 | 125 | | |
126 | 126 | | |
127 | 127 | | |
| |||
169 | 169 | | |
170 | 170 | | |
171 | 171 | | |
172 | | - | |
| 172 | + | |
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
Lines changed: 8 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
57 | 58 | | |
58 | 59 | | |
59 | 60 | | |
| |||
Lines changed: 31 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| |||
663 | 663 | | |
664 | 664 | | |
665 | 665 | | |
| 666 | + | |
| 667 | + | |
666 | 668 | | |
667 | 669 | | |
668 | 670 | | |
669 | 671 | | |
670 | | - | |
| 672 | + | |
| 673 | + | |
671 | 674 | | |
672 | 675 | | |
673 | 676 | | |
| |||
678 | 681 | | |
679 | 682 | | |
680 | 683 | | |
681 | | - | |
| 684 | + | |
682 | 685 | | |
683 | 686 | | |
684 | 687 | | |
685 | 688 | | |
686 | 689 | | |
687 | | - | |
688 | | - | |
689 | | - | |
690 | | - | |
691 | | - | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
692 | 697 | | |
693 | 698 | | |
694 | 699 | | |
| |||
1014 | 1019 | | |
1015 | 1020 | | |
1016 | 1021 | | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
1017 | 1040 | | |
1018 | 1041 | | |
1019 | 1042 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
| 80 | + | |
80 | 81 | | |
81 | 82 | | |
82 | 83 | | |
| |||
0 commit comments