[SPARK-24528][SQL] Add support to read multiple sorted bucket files for data source v1 by c21 · Pull Request #29625 · apache/spark

c21 · 2020-09-02T05:52:02Z

What changes were proposed in this pull request?

This PR is to support reading sorted bucket table efficiently in data source v1 read path. Previously in FileSourceScanExec -> FileScanRDD, we read input file sequentially one by one. For sorted bucket table which potentially can have multiple sorted files per bucket, the sort ordering for each bucket cannot be preserved as we read sorted files one by one. This PR is to add the support to read sorted files all together in a sort-merge way to preserve the ordering for each bucket.

Specifically the code change is:

Add a parameter ScanMode (which can be either RowMode (read row by row - non-vectorization), BatchMode (read batch by batch - vectorization), or SortedBucketMode (read rows in sort-merge way - sorted bucketed file case)) for FileScanRDD. FileScanRDD.compute() decides which iterator to use based on this ScanMode.
Extract the existing iterator logic in FileScanRDD to BaseFileScanIterator, which holds common logic for file scan. BaseFileScanIterator has 3 subclasses where FileRowScanIterator reads row by row for each file (RowMode), and FileBatchScanIterator reads batch by batch for each file (BatchMode), and FileSortedBucketScanIterator reads row by row based on each row ordering across multiple files (SortedBucketMode).
FileSortedBucketScanIterator: a priority queue is used to output rows from multiple sorted files based on table sorted columns.

The PR also separates the logic for existing row scan and batch scan, so we don't need to do if (nextElement.isInstanceOf[ColumnarBatch]) { at line 100 of FileScanRDD per row, which was a long-standing TODO.

The whole feature is controlled by a new config spark.sql.sources.bucketing.sortedScan.enabled which is disabled by default, as there can be a risk to read more data in task and cause OOM (especially with vectorization, we need to keep each batch of each file in memory at any time). In addition, we need to set InputFileBlockHolder for each row (as any row from any file can be outputted), so it can be costly.

Why are the changes needed?

Right now for sorted bucket table, even though each individual file is sorted, the current file scan approach does not preserve sort ordering across files in one bucket, so a sort still needs for a sort-merge join or sort aggregate. If the table is big, the external sort can happen and cause extra CPU and IO overhead. Introducing the code path here to read multiple sorted files in a sort-merge way to avoid later sort before join or aggregate, to save CPU and IO resource.

At our internal fork, we added the similar support (https://www.youtube.com/watch?v=brzInUisshY&feature=youtu.be&t=910), though the code path is different (we are reading/writing hive table).

Does this PR introduce any user-facing change?

Yes. A new user-facing config spark.sql.sources.bucketing.sortedScan.enabled is introduced to allow users to read sorted bucketed table efficiently (e.g. no sort for sort merge join / sort aggregate on bucketed columns).

Example query plan with no Sort before SortMergeJoin:

SortMergeJoin [i#38, j#39], [i#44, j#45], Inner
:- Filter (isnotnull(i#38) AND isnotnull(j#39))
:  +- FileScan parquet default.bucketed_table1[i#38,j#39,k#40] DataFilters: [isnotnull(i#38), isnotnull(j#39)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.sources..., PartitionFilters: [], PushedFilters: [IsNotNull(i), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>, ScanMode: SortedBucketMode, SelectedBucketsCount: 8 out of 8
+- Filter (isnotnull(i#44) AND isnotnull(j#45))
   +- FileScan parquet default.bucketed_table2[i#44,j#45,k#46] DataFilters: [isnotnull(i#44), isnotnull(j#45)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.sources..., PartitionFilters: [], PushedFilters: [IsNotNull(i), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>, ScanMode: SortedBucketMode, SelectedBucketsCount: 8 out of 8

How was this patch tested?

Added unit tests in BucketedReadSuite.scala and ExplainSuite.scala.

c21 · 2020-09-02T05:54:49Z

cc @cloud-fan , @viirya , @imback82 , and @sameeragarwal to take a look when you guys have time, thanks.
More unit tests will be added if we agree on overall approach, thanks.

c21

Just highlight the change from FileScanRDD.compute() to BaseFileScanIterator, to help review more easily.

c21 · 2020-09-02T05:58:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanIterators.scala

+
+  override def next(): Object
+
+  private def readFile(file: PartitionedFile): Iterator[InternalRow] = {


This is changed compared to FileScanRDD.readCurrentFile, which takes a parameter instead of relying on currentFile, because we need to use nextIterator() to call readFile() multiple times for different files in initializeHeapWithFirstRows().

SparkQA · 2020-09-02T07:05:02Z

Test build #128187 has finished for PR 29625 at commit 3a49ab0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-09-03T05:11:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanIterators.scala

+        context.killTaskIfInterrupted()
+        (currentIterator != null && currentIterator.hasNext) || nextIterator()


This is changed compared to FileScanRDD.nextIterator, not call hasNext() because FileSortedBucketScanIterator needs to override hasNext().

SparkQA · 2020-09-03T05:46:34Z

Test build #128228 has finished for PR 29625 at commit 96a682a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-03T07:05:02Z

Test build #128229 has finished for PR 29625 at commit 774bd5e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-03T16:06:45Z

retest this please

SparkQA · 2020-09-03T21:08:13Z

Test build #128264 has finished for PR 29625 at commit 774bd5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-09-11T19:42:15Z

After discussing with @cloud-fan , it would be better to have a rule to automatically decide whether to do bucket sorted scan based on query shape, and same for bucket scan. So I will first do bucket scan in https://issues.apache.org/jira/browse/SPARK-24528, and redo this one after the PR for https://issues.apache.org/jira/browse/SPARK-24528 is merged.

…cally ### What changes were proposed in this pull request? This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism. The feature is to add a physical plan rule right after `EnsureRequirements`: The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan. Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure. The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan . ### Why are the changes needed? To avoid unnecessary bucketed scan in the query, and this is prerequisite for #29625 (decide bucketed sorted scan dynamically will be added later in that PR). ### Does this PR introduce _any_ user-facing change? A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in #29804 (comment)). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature. ### How was this patch tested? Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala` Closes #29804 from c21/bucket-rule. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

github-actions · 2020-12-21T00:54:54Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

rahij · 2021-02-04T15:54:59Z

@c21 I wanted to ask if you were planning on continuing this PR now that https://github.com/apache/spark/pull/29804/files has been merged?

c21 · 2021-02-04T20:05:20Z

@rahij - yes I am. Will raise a PR soon, thanks.

rahij · 2021-03-08T21:09:18Z

@c21 Just wanted to check if you had any time to revive this PR. I'm happy to create a new PR off this branch, but I figured you know this code much better than I do in case we need to change/fix something.

c21 · 2021-03-08T21:36:43Z

@rahij - sure I will take a look of palantir#730. Sorry I was busy with something else, I will have a PR ready in this week.

Btw, I got feedback earlier for this PR, that we want to enable/disable sorted bucketed scan automatically similar to #29804 . So I need more time to craft for that, thanks.

rahij · 2021-03-08T21:39:00Z

Thank you, that makes sense. You don't have to take a look at palantir#730 - the current master branch is not up to date with apache/spark right now (but hopefully should soon). So we can just cherry pick whatever you get merged upstream once our branch is up to date.

c21 · 2021-03-08T21:41:42Z

@rahij - yes sure. sorry I was just reading as "review" instead of "revive" by mistake :)

c21 added 2 commits September 1, 2020 22:24

Add support to read multiple sorted bucket files for data source v1

99f424f

Add extra newline at end of FileScanIterators.scala

3a49ab0

probot-autolabeler bot added the SQL label Sep 2, 2020

c21 commented Sep 2, 2020

View reviewed changes

Try to fix unit test failure

96a682a

c21 commented Sep 3, 2020

View reviewed changes

Remove ScanMode to try to fix unit test failure for now

774bd5e

c21 mentioned this pull request Sep 18, 2020

[SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically #29804

Closed

github-actions bot added the Stale label Dec 21, 2020

github-actions bot closed this Dec 22, 2020

rahij mentioned this pull request Feb 10, 2021

add support for reading multiple sorted files per bucket palantir/spark#730

Closed

rshkv mentioned this pull request Mar 19, 2021

Support for reading multiple sorted files per bucket palantir/spark#742

Merged


		override def next(): Object

		private def readFile(file: PartitionedFile): Iterator[InternalRow] = {

		context.killTaskIfInterrupted()
		(currentIterator != null && currentIterator.hasNext) \|\| nextIterator()

Conversation

c21 commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Sep 2, 2020

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

c21 Sep 2, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 2, 2020

Uh oh!

c21 Sep 3, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 3, 2020

Uh oh!

SparkQA commented Sep 3, 2020

Uh oh!

viirya commented Sep 3, 2020

Uh oh!

SparkQA commented Sep 3, 2020

Uh oh!

c21 commented Sep 11, 2020

Uh oh!

github-actions bot commented Dec 21, 2020

Uh oh!

rahij commented Feb 4, 2021

Uh oh!

c21 commented Feb 4, 2021

Uh oh!

rahij commented Mar 8, 2021

Uh oh!

c21 commented Mar 8, 2021

Uh oh!

rahij commented Mar 8, 2021

Uh oh!

c21 commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

c21 commented Sep 2, 2020 •

edited

Loading