Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jan 6, 2021

What changes were proposed in this pull request?

This pr add row count to Union operator when CBO enabled.

spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost")

Before this pr:

== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B)
:- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)

After this pr:

== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20)
:- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)

Why are the changes needed?

Improve query performance, JoinEstimation.estimateInnerOuterJoin need the row count.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Comment on lines +82 to +90
override def visitUnion(p: Union): Statistics = {
val stats = p.children.map(_.stats)
val rowCount = if (stats.exists(_.rowCount.isEmpty)) {
None
} else {
Some(stats.map(_.rowCount.get).sum)
}
Statistics(sizeInBytes = stats.map(_.sizeInBytes).sum, rowCount = rowCount)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same logic, just add row count:

override def visitUnion(p: Union): Statistics = {
Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).sum)
}

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38334/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38334/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Test build #133746 has finished for PR 31068 at commit c0dbbe4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jan 6, 2021

Seems needing to update query plan files.

@github-actions github-actions bot added the SQL label Jan 6, 2021
@maropu
Copy link
Member

maropu commented Jan 6, 2021

Looks fine except for the @viirya comment.

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38358/

@HyukjinKwon
Copy link
Member

@wangyum, sorry but can you push an empty commit to retrigger the GA build? I would like to keep the result of the test failure because it looks like a flaky test.

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38358/

@HyukjinKwon
Copy link
Member

The flaky tests are known and being fixed at https://github.com/apache/spark/pull/31076. Let me just merge this in

@HyukjinKwon
Copy link
Member

Merged to master.

@wangyum wangyum deleted the SPARK-34031 branch January 7, 2021 06:17
@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Test build #133770 has finished for PR 31068 at commit 3c5af90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants