[SPARK-15453] [SQL] FileSourceScanExec to extract `outputOrdering` information #14864

tejasapatil · 2016-08-29T16:02:32Z

What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-15453

Extracting sort ordering information in FileSourceScanExec so that planner can make use of it. My motivation to make this change was to get Sort Merge join in par with Hive's Sort-Merge-Bucket join when the source tables are bucketed + sorted.

Query:

val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1)
df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table8")
df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table9")
context.sql("SELECT * FROM table8 a JOIN table9 b ON a.j=b.j AND a.k=b.k").explain(true)

Before:

== Physical Plan ==
*SortMergeJoin [j#120, k#121], [j#123, k#124], Inner
:- *Sort [j#120 ASC, k#121 ASC], false, 0
:  +- *Project [i#119, j#120, k#121]
:     +- *Filter (isnotnull(k#121) && isnotnull(j#120))
:        +- *FileScan orc default.table8[i#119,j#120,k#121] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
+- *Sort [j#123 ASC, k#124 ASC], false, 0
+- *Project [i#122, j#123, k#124]
+- *Filter (isnotnull(k#124) && isnotnull(j#123))
 +- *FileScan orc default.table9[i#122,j#123,k#124] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>

After: (note that the Sort step is no longer there)

== Physical Plan ==
*SortMergeJoin [j#49, k#50], [j#52, k#53], Inner
:- *Project [i#48, j#49, k#50]
:  +- *Filter (isnotnull(k#50) && isnotnull(j#49))
:     +- *FileScan orc default.table8[i#48,j#49,k#50] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
+- *Project [i#51, j#52, k#53]
   +- *Filter (isnotnull(j#52) && isnotnull(k#53))
      +- *FileScan orc default.table9[i#51,j#52,k#53] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct<i:int,j:int,k:string>

How was this patch tested?

Added a test case in JoinSuite. Ran all other tests in JoinSuite

…mize query plan BEFORE ``` val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1) hc.sql("DROP TABLE table8").collect df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table8") hc.sql("DROP TABLE table9").collect df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table9") hc.sql("SELECT * FROM table8 a JOIN table9 b ON a.j=b.j AND a.k=b.k").explain(true) == Parsed Logical Plan == 'Project [*] +- 'Join Inner, (('a.j = 'b.j) && ('a.k = 'b.k)) :- 'UnresolvedRelation table8, a +- 'UnresolvedRelation table9, b == Analyzed Logical Plan == i: int, j: int, k: string, i: int, j: int, k: string Project [i#119, j#120, k#121, i#122, j#123, k#124] +- Join Inner, ((j#120 = j#123) && (k#121 = k#124)) :- SubqueryAlias a : +- SubqueryAlias table8 : +- Relation[i#119,j#120,k#121] orc +- SubqueryAlias b +- SubqueryAlias table9 +- Relation[i#122,j#123,k#124] orc == Optimized Logical Plan == Join Inner, ((j#120 = j#123) && (k#121 = k#124)) :- Filter (isnotnull(k#121) && isnotnull(j#120)) : +- Relation[i#119,j#120,k#121] orc +- Filter (isnotnull(k#124) && isnotnull(j#123)) +- Relation[i#122,j#123,k#124] orc == Physical Plan == *SortMergeJoin [j#120, k#121], [j#123, k#124], Inner :- *Sort [j#120 ASC, k#121 ASC], false, 0 : +- *Project [i#119, j#120, k#121] : +- *Filter (isnotnull(k#121) && isnotnull(j#120)) : +- *FileScan orc default.table8[i#119,j#120,k#121] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string> +- *Sort [j#123 ASC, k#124 ASC], false, 0 +- *Project [i#122, j#123, k#124] +- *Filter (isnotnull(k#124) && isnotnull(j#123)) +- *FileScan orc default.table9[i#122,j#123,k#124] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string> ``` AFTER ``` == Parsed Logical Plan == 'Project [*] +- 'Join Inner, (('a.j = 'b.j) && ('a.k = 'b.k)) :- 'UnresolvedRelation `table8`, a +- 'UnresolvedRelation `table9`, b == Analyzed Logical Plan == i: int, j: int, k: string, i: int, j: int, k: string Project [i#48, j#49, k#50, i#51, j#52, k#53] +- Join Inner, ((j#49 = j#52) && (k#50 = k#53)) :- SubqueryAlias a : +- SubqueryAlias table8 : +- Relation[i#48,j#49,k#50] orc +- SubqueryAlias b +- SubqueryAlias table9 +- Relation[i#51,j#52,k#53] orc == Optimized Logical Plan == Join Inner, ((j#49 = j#52) && (k#50 = k#53)) :- Filter (isnotnull(k#50) && isnotnull(j#49)) : +- Relation[i#48,j#49,k#50] orc +- Filter (isnotnull(j#52) && isnotnull(k#53)) +- Relation[i#51,j#52,k#53] orc == Physical Plan == *SortMergeJoin [j#49, k#50], [j#52, k#53], Inner :- *Project [i#48, j#49, k#50] : +- *Filter (isnotnull(k#50) && isnotnull(j#49)) : +- *FileScan orc default.table8[i#48,j#49,k#50] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string> +- *Project [i#51, j#52, k#53] +- *Filter (isnotnull(j#52) && isnotnull(k#53)) +- *FileScan orc default.table9[i#51,j#52,k#53] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct<i:int,j:int,k:string> ```

tejasapatil · 2016-08-29T16:06:33Z

ok to test

SparkQA · 2016-08-29T17:31:57Z

Test build #64552 has finished for PR 14864 at commit 07196a8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-08-29T19:05:26Z

Jenkins test this please. The last run had JVM crash

SparkQA · 2016-08-29T21:17:46Z

Test build #64592 has finished for PR 14864 at commit 07196a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-08-29T21:55:05Z

cc @rxin , @cloud-fan for review

cloud-fan · 2016-08-31T15:01:15Z

I'm not sure it's safe to do so. A bucket may have more than one files(this can happen if we append data into bucketed table), so for each file, it's sorted, but for the whole bucket, it is NOT, because currently Spark SQL doesn't guarantee the records are sorted across all files of a bucket.

tejasapatil · 2016-08-31T15:11:01Z

@cloud-fan : I have taken care of that case in the PR (see L175 to L185). The sort ordering will only be used when all the buckets have single file. In subsequent PRs I plan to extend this so that it will benefit Hive tables where there is strict guarantee about single file per bucket.

cloud-fan · 2016-09-01T05:55:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+              relation.location.listFiles(partitionFilters).flatMap(partition => partition.files)
+            val bucketToFilesGrouping =
+              files.map(_.getPath.getName).groupBy(file => BucketingUtils.getBucketId(file))
+            val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 1)


listing files and grouping by bucket id can be expensive, if there are a lot of files. What's worse, we will do it again in createBucketedReadRDD.

Instead of doing this, I'd like to fix the sorting problem for bucketed table first, then we don't need to scan file names to get the outputOrdering

For the sorting problem, one way to fix would be to do what Hive does : create single file per bucket. For any other approach, since there would be multiple files per bucket, one would have to globally sort them while reading it. This would in a way be sub-optimal because tables tend to be "write-once, read many" and spending more CPU once for write path to generate single file would be better.

When I came across this, I wondered why it was designed this way. I even posted about this to dev group earlier today : http://apache-spark-developers-list.1001551.n3.nabble.com/Questions-about-bucketing-in-Spark-td18814.html

To give you some context, I am trying to drive adoption for Spark within Facebook. We have lot of tables which would benefit from having full bucketing support. So my high level goal is to get Spark's bucketing in par with Hive's in terms of features and compatibility.

yea that's a good question, single file per bucket looks more reasonable, it's more important to read bucketed table fast than writing it fast. But how about data insertion? Does hive support inserting into bucketed table?

@cloud-fan : Open source Hive allows INSERTing data into bucketed table but it breaks the guarantee about one file per bucket. We could do better in two ways:

Disallow operations which would break bucketing guarantee OR

Always preserve bucketing across all operations which would mean rewriting of the entire table at times (eg INSERT INTO) and more complication in the code.

I think the later is a better model for longer term. But we could start with first one and work over it iteratively.

cloud-fan · 2016-09-02T01:32:36Z

hi @tejasapatil , I think your second proposal(always preserve bucketing across all operations) makes sense, but need more discussion to make the decision. For now, I think it's ok to use your current approach, e.g. only expose the outputOrdering if a bucket only has a single file, but we should avoid listing files twice.

We can add a lazy val in FileSourceScanExec, to store the files after partition prunning, and use it to calculate outputOrdering and inputRDD.

What do you think?

tejasapatil · 2016-09-02T22:39:50Z

@cloud-fan : Sounds good to me.

I tried doing that but got a Task not serializable: java.io.NotSerializableException: org.apache.hadoop.fs.LocatedFileStatus. This is because the new lazy val introduced is a sequence of Partition and Partition comprises of FileStatus which is hadoop internal [0]. LocatedFileStatus is not serializable. Any suggestion on how to work around that ?

[0] : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L324

cloud-fan · 2016-09-03T03:38:34Z

mark it as @transient?

tejasapatil · 2016-09-03T13:09:54Z

@cloud-fan : Thanks !! Did the change.

Jenkins, test this please

SparkQA · 2016-09-03T15:05:58Z

Test build #64898 has finished for PR 14864 at commit c60afd6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-03T15:49:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

should we follow the same way to handle bucket columns? i.e.

val bucketColumns = spec.bucketColumnNames.flatMap { n => output.find(_.name == n) } if (bucketColumns.size == spec.bucketColumnNames.size) {

If the required output doesn't contain sort columns, should we just ignore the sorting, or throw exception?

@cloud-fan : Sure. Did this change.

I am throwing exception because end user should know that there is something wrong with the table metadata and they need to look into that.

cloud-fan · 2016-09-03T16:22:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+
+        def toAttribute(colName: String, columnType: String): Attribute =
+          output.find(_.name == colName).getOrElse {
+            throw new AnalysisException(s"Could not find $columnType column $colName for " +


My concern is that, if a table has 3 columns:i, j, k, and is bucketed by i and j, sorted by j and k. Now we wanna read i and j from this table, then the generated RDD should be bucketed, i.e. the number of partitions of this RDD should be equal to the number of buckets. For each RDD partition, can we treat it as sorted by j?

I see what you meant earlier. I have made changes to the PR to follow this:

For bucketed columns:
HashPartitioning would be used only when:

ALL the bucketing columns are being read from the table

For sorted columns:
Sort ordering should be used when ALL these criteria's match:

HashPartitioning is being used

A prefix (or all) of the sort columns are being read from the table.

Sort ordering would be over the prefix subset of sort columns being read
from the table. eg.
Assume (col0, col2, col3) are the columns read from the table

If sort columns are (col0, col1), then sort ordering would be considered as (col0)

If sort columns are (col1, col0), then sort ordering would be empty as per rule Removed reference to incubation in Spark user docs. #2 above

SparkQA · 2016-09-03T17:29:51Z

Test build #64901 has finished for PR 14864 at commit 568b742.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-09T07:06:57Z

Test build #65136 has finished for PR 14864 at commit 445549b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-09T13:56:01Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

+            .map(i => (i, i * 2, i.toString))
+            .toDF("i", "j", "k")
+            .coalesce(1)
+        df.write.bucketBy(4, "j", "k").sortBy("j", "k").saveAsTable("SPARK_15453_table_a")


should we bucket the table by i, j and sort it by j, k? To reflect the test name if the join predicates are subset of the sorted columns

For SMB to happen, bucketing columns == sort columns == join keys. My naming for the test case was wrong. I have deleted this test as I found a better place to add this test.

cloud-fan · 2016-09-09T13:57:44Z

LGTM except some minor comments about tests, thanks for working on it!

SparkQA · 2016-09-09T18:19:39Z

Test build #65155 has finished for PR 14864 at commit 7db6c10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-09T19:44:02Z

Test build #65159 has finished for PR 14864 at commit 070c249.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-10T01:27:47Z

thanks, merging to master!

…ormation ## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-15453 Extracting sort ordering information in `FileSourceScanExec` so that planner can make use of it. My motivation to make this change was to get Sort Merge join in par with Hive's Sort-Merge-Bucket join when the source tables are bucketed + sorted. Query: ``` val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1) df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table8") df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table9") context.sql("SELECT * FROM table8 a JOIN table9 b ON a.j=b.j AND a.k=b.k").explain(true) ``` Before: ``` == Physical Plan == *SortMergeJoin [j#120, k#121], [j#123, k#124], Inner :- *Sort [j#120 ASC, k#121 ASC], false, 0 : +- *Project [i#119, j#120, k#121] : +- *Filter (isnotnull(k#121) && isnotnull(j#120)) : +- *FileScan orc default.table8[i#119,j#120,k#121] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string> +- *Sort [j#123 ASC, k#124 ASC], false, 0 +- *Project [i#122, j#123, k#124] +- *Filter (isnotnull(k#124) && isnotnull(j#123)) +- *FileScan orc default.table9[i#122,j#123,k#124] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string> ``` After: (note that the `Sort` step is no longer there) ``` == Physical Plan == *SortMergeJoin [j#49, k#50], [j#52, k#53], Inner :- *Project [i#48, j#49, k#50] : +- *Filter (isnotnull(k#50) && isnotnull(j#49)) : +- *FileScan orc default.table8[i#48,j#49,k#50] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string> +- *Project [i#51, j#52, k#53] +- *Filter (isnotnull(j#52) && isnotnull(k#53)) +- *FileScan orc default.table9[i#51,j#52,k#53] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct<i:int,j:int,k:string> ``` ## How was this patch tested? Added a test case in `JoinSuite`. Ran all other tests in `JoinSuite` Author: Tejas Patil <[email protected]> Closes apache#14864 from tejasapatil/SPARK-15453_smb_optimization.

ankravch · 2017-02-10T23:04:37Z

Was this patch merged in spark2.0.1 release? If so, how can it be enforced/turned on?

cloud-fan · 2017-02-10T23:19:42Z

This is an optimization and we usually don't backport optimizations.

…dering` information ### What changes were proposed in this pull request? `outputOrdering` is also dependent on whether the bucket has more than one files. The test cases fail when we try to move them to sql/core. This PR is to fix the test cases introduced in #14864 and add a test case to verify [the related logics](https://github.com/tejasapatil/spark/blob/070c24994747c0479fb2520774ede27ff1cf8cac/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L197-L206). ### How was this patch tested? N/A Author: Xiao Li <[email protected]> Closes #16994 from gatorsmile/bucketingTS.

…dering` information ### What changes were proposed in this pull request? `outputOrdering` is also dependent on whether the bucket has more than one files. The test cases fail when we try to move them to sql/core. This PR is to fix the test cases introduced in apache#14864 and add a test case to verify [the related logics](https://github.com/tejasapatil/spark/blob/070c24994747c0479fb2520774ede27ff1cf8cac/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L197-L206). ### How was this patch tested? N/A Author: Xiao Li <[email protected]> Closes apache#16994 from gatorsmile/bucketingTS.

tejasapatil mentioned this pull request Aug 29, 2016

[SPARK-15453] [SQL] Sort Merge Join to use bucketing metadata to optimize query plan #13231

Closed

cloud-fan reviewed Sep 1, 2016
View reviewed changes

cloud-fan reviewed Sep 3, 2016
View reviewed changes

Moving listing files to a lazy val

568b742

tejasapatil force-pushed the SPARK-15453_smb_optimization branch from c60afd6 to 568b742 Compare September 3, 2016 16:10

cloud-fan reviewed Sep 3, 2016
View reviewed changes

review comments

445549b

cloud-fan reviewed Sep 9, 2016
View reviewed changes

tejasapatil added 2 commits September 9, 2016 09:13

More comprehensive tests

7db6c10

removed unused import

070c249

asfgit closed this in 3354917 Sep 10, 2016

tejasapatil deleted the SPARK-15453_smb_optimization branch September 10, 2016 03:39

gatorsmile mentioned this pull request Feb 20, 2017

[SPARK-15453] [SQL] [Follow-up] FileSourceScanExec to extract outputOrdering information #16994

Closed

[SPARK-15453] [SQL] FileSourceScanExec to extract outputOrdering information #14864

[SPARK-15453] [SQL] FileSourceScanExec to extract outputOrdering information #14864

Uh oh!

Conversation

tejasapatil commented Aug 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tejasapatil commented Aug 29, 2016

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

tejasapatil commented Aug 29, 2016

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

tejasapatil commented Aug 29, 2016

Uh oh!

cloud-fan commented Aug 31, 2016

Uh oh!

tejasapatil commented Aug 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 2, 2016

Uh oh!

tejasapatil commented Sep 2, 2016

Uh oh!

cloud-fan commented Sep 3, 2016

Uh oh!

tejasapatil commented Sep 3, 2016

Uh oh!

SparkQA commented Sep 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 3, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 9, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

cloud-fan commented Sep 10, 2016

Uh oh!

ankravch commented Feb 10, 2017

Uh oh!

cloud-fan commented Feb 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-15453] [SQL] FileSourceScanExec to extract `outputOrdering` information #14864

[SPARK-15453] [SQL] FileSourceScanExec to extract `outputOrdering` information #14864

tejasapatil commented Aug 29, 2016 •

edited

Loading