[SPARK-32986][SQL] Add bucketed scan info in query plan of data source v1 by c21 · Pull Request #33698 · apache/spark

c21 · 2021-08-11T03:29:38Z

What changes were proposed in this pull request?

As a followup from discussion in #29804 (comment) , currently the query plan for data source v1 scan operator - FileSourceScanExec has no information to indicate whether the table is read as bucketed table or not. And if table not read as bucketed table, what's the reason behind it. Add this info into FileSourceScanExec physical query plan output, can help users and developers understand query plan more easily without spending a lot of time debugging why table is not read as bucketed table.

Why are the changes needed?

Help users and developers debug query plan for bucketed table.

Does this PR introduce any user-facing change?

The added Bucketed information in physical query plan when reading bucketed table.
Note for reading non-bucketed table, the query plan stays same and nothing is changed.

Example:

Seq((1, 2), (2, 3)).toDF("i", "j").write.bucketBy(8, "i").saveAsTable("t1")
Seq(2, 3).toDF("i").write.bucketBy(8, "i").saveAsTable("t2")
val df1 = spark.table("t1")
val df2 = spark.table("t2")
df1.join(df2, df1("i") === df2("i"))

AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [i#20], [i#24], Inner
   :- Sort [i#20 ASC NULLS FIRST], false, 0
   :  +- Filter isnotnull(i#20)
   :     +- FileScan parquet default.t1[i#20,j#21] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#20)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8
   +- Sort [i#24 ASC NULLS FIRST], false, 0
      +- Filter isnotnull(i#24)
         +- FileScan parquet default.t2[i#24] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#24)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8

How was this patch tested?

Added unit test in ExplainSuite.scala.

c21 · 2021-08-11T03:35:37Z

@cloud-fan and @maropu could you help take a look when you have time? Thanks!

SparkQA · 2021-08-11T04:40:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46813/

SparkQA · 2021-08-11T05:23:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46813/

dongjoon-hyun

It looks good to me.

cloud-fan · 2021-08-11T07:42:52Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
+        checkKeywordsExistsInExplain(
+          df1.join(df2, df1("i") === df2("i")),
+          "Bucketed: true" :: Nil: _*)


nit: "Bucketed: true" :: Nil: _* -> "Bucketed: true"?

yeah, updated for this and other places.

cloud-fan · 2021-08-11T07:42:59Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+      withSQLConf(SQLConf.BUCKETING_ENABLED.key -> "false") {
+        checkKeywordsExistsInExplain(
+          df1.join(df2, df1("i") === df2("i")),
+          "Bucketed: false (disabled by configuration)" :: Nil: _*)


cloud-fan · 2021-08-11T07:43:05Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+          "Bucketed: false (disabled by configuration)" :: Nil: _*)
+      }
+
+      checkKeywordsExistsInExplain(df1, "Bucketed: false (disabled by query planner)" :: Nil: _*)


cloud-fan · 2021-08-11T07:43:21Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+
+      checkKeywordsExistsInExplain(
+        df1.select("j"),
+        "Bucketed: false (bucket column(s) not read)" :: Nil: _*)


SparkQA · 2021-08-11T08:18:10Z

Test build #142306 has finished for PR 33698 at commit b087cd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-11T21:56:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46856/

SparkQA · 2021-08-11T22:50:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46856/

SparkQA · 2021-08-12T01:45:32Z

Test build #142348 has finished for PR 33698 at commit 2d42d37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-08-12T02:42:38Z

Merged to master for Apache Spark 3.3.0 according to the issue type, Improvement.

dongjoon-hyun · 2021-08-12T02:42:53Z

Thank you, @c21 and @cloud-fan .

c21 · 2021-08-12T05:09:48Z

Thank you @dongjoon-hyun and @cloud-fan for review!

Add bucketed scan info in query plan of data source v1

b087cd3

github-actions bot added the SQL label Aug 11, 2021

dongjoon-hyun approved these changes Aug 11, 2021

View reviewed changes

cloud-fan reviewed Aug 11, 2021

View reviewed changes

cloud-fan approved these changes Aug 11, 2021

View reviewed changes

Address comment

2d42d37

dongjoon-hyun closed this in 79515e4 Aug 12, 2021

c21 deleted the scan-v1 branch August 12, 2021 19:37

Conversation

c21 commented Aug 11, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

c21 Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 11, 2021

Uh oh!

SparkQA commented Aug 12, 2021

Uh oh!

dongjoon-hyun commented Aug 12, 2021

Uh oh!

dongjoon-hyun commented Aug 12, 2021

Uh oh!

c21 commented Aug 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants