[SPARK-34031][SQL] Union operator missing rowCount when CBO enabled #31068

wangyum · 2021-01-06T13:27:56Z

What changes were proposed in this pull request?

This pr add row count to Union operator when CBO enabled.

spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost")

Before this pr:

== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B)
:- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)

After this pr:

== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20)
:- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)

Why are the changes needed?

Improve query performance, JoinEstimation.estimateInnerOuterJoin need the row count.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

wangyum · 2021-01-06T14:15:32Z

...cala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala

+  override def visitUnion(p: Union): Statistics = {
+    val stats = p.children.map(_.stats)
+    val rowCount = if (stats.exists(_.rowCount.isEmpty)) {
+      None
+    } else {
+      Some(stats.map(_.rowCount.get).sum)
+    }
+    Statistics(sizeInBytes = stats.map(_.sizeInBytes).sum, rowCount = rowCount)
+  }


Same logic, just add row count:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

Lines 148 to 150 in 6c5ba81

override def visitUnion(p: Union): Statistics = {

Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).sum)

}

SparkQA · 2021-01-06T15:01:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38334/

SparkQA · 2021-01-06T15:07:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38334/

SparkQA · 2021-01-06T17:17:50Z

Test build #133746 has finished for PR 31068 at commit c0dbbe4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-01-06T18:06:03Z

Seems needing to update query plan files.

maropu · 2021-01-06T23:22:20Z

Looks fine except for the @viirya comment.

SparkQA · 2021-01-07T05:02:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38358/

HyukjinKwon · 2021-01-07T05:31:58Z

@wangyum, sorry but can you push an empty commit to retrigger the GA build? I would like to keep the result of the test failure because it looks like a flaky test.

SparkQA · 2021-01-07T05:33:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38358/

HyukjinKwon · 2021-01-07T05:37:17Z

The flaky tests are known and being fixed at https://github.com/apache/spark/pull/31076. Let me just merge this in

HyukjinKwon · 2021-01-07T05:41:04Z

Merged to master.

SparkQA · 2021-01-07T08:23:21Z

Test build #133770 has finished for PR 31068 at commit 3c5af90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Union operator missing rowCount

c0dbbe4

wangyum commented Jan 6, 2021

View reviewed changes

viirya approved these changes Jan 6, 2021

View reviewed changes

github-actions bot added the SQL label Jan 6, 2021

Fix

3c5af90

HyukjinKwon approved these changes Jan 7, 2021

View reviewed changes

Fix a nit and retrigger GA build

7521f42

HyukjinKwon closed this in aa509c1 Jan 7, 2021

wangyum deleted the SPARK-34031 branch January 7, 2021 06:17

	override def visitUnion(p: Union): Statistics = {
	Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).sum)
	}

[SPARK-34031][SQL] Union operator missing rowCount when CBO enabled #31068

[SPARK-34031][SQL] Union operator missing rowCount when CBO enabled #31068

Uh oh!

Conversation

wangyum commented Jan 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangyum Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

viirya commented Jan 6, 2021

Uh oh!

maropu commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 7, 2021

Uh oh!

HyukjinKwon commented Jan 7, 2021

Uh oh!

SparkQA commented Jan 7, 2021

Uh oh!

HyukjinKwon commented Jan 7, 2021

Uh oh!

HyukjinKwon commented Jan 7, 2021

Uh oh!

SparkQA commented Jan 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangyum commented Jan 6, 2021 •

edited

Loading