add support for reading multiple sorted files per bucket by rahij · Pull Request #730 · palantir/spark

rahij · 2021-02-10T15:28:26Z

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

This PR is a modified version of an unmerged PR upstream (that will be reopened by the author soon): apache#29625. However, since we are not fully caught up with the 3.0 branch and we need this feature internally, I have modified it to work on our branch with the least amount of changes required.

What changes were proposed in this pull request?

Quick background: When there are multiple files in a single bucket, spark does not propagate the sort ordering to the FileSourceScanExec node. This means that if a parent operator requires a child ordering that is equal to the file ordering in the buckets, we still end up sorting every partition. This PR propagates the sort ordering and creates an RDD that produces rows by merging these sorted iterators.

The diff looks a bit large but the actual changes are minimal:

FileScanRDD used to contain all the logic to produce the next rows. In this PR, we pass it a ScanMode instead that then delegates to a different iterator if we need a sorted bucketed scan.
The FileScanIterators contains a BaseFileScanIterator and 3 subclasses for row based scans, column batch scans and the sorted bucketed scan.
The methods in BaseFileScanIterator are a literal copy paste of what used to be in FileScanRDD except https://github.com/palantir/spark/pull/730/files#diff-c64b05200405088131067d856ed7d9d29290d47881018c7a7b0db4668ddda9d3R140-R143.
The next methods implemented by the FileRowScanIterator and FileBatchScanIterator are also exact copy pastes from FileScanRDD, except that we have removed this if-else check here and split it into 2 different iterators similar to the upstream PR - this is purely for cleanup and I can merge them back if you prefer.
The next method in FileSortedBucketScanIterator is the core logic of this change - this is a literal copy paste from the upstream PR. It holds a min heap of the next element in the backing iterators and returns the head. This will require a higher memory footprint for the vectorized readers since it holds the next batch from all of the backing iterators in memory.

Whatever conflicts this causes with out 3.0 branch, I can take responsibility for resolving those. Once the upstream PR has merged and we are up to date with 3.0, I will revert this PR and cherry pick the upstream one.

How was this patch tested?

Unit tests. It is also hidden behind a flag like the upstream PR, so we can selectively enable it initially before rolling out more widely.

cc @mattsills

rahij added 8 commits February 10, 2021 13:57

add supprot for reading multiple sorted files per bucket

c782604

fix conf and imports

23349ae

keep logic more similar

85320d0

move comment

8ab1f8d

fix imports

fea21b1

rm new import

0581815

fixes

10bf1db

fix tests

edbe575

rahij requested review from jdcasale and rshkv February 10, 2021 15:30

rahij mentioned this pull request Feb 17, 2021

Support for reading multiple sorted files per bucket #731

Merged

rahij closed this Feb 19, 2021

rahij deleted the rr/sorted-merge branch February 19, 2021 12:18

c21 mentioned this pull request Mar 8, 2021

[SPARK-24528][SQL] Add support to read multiple sorted bucket files for data source v1 apache/spark#29625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for reading multiple sorted files per bucket#730

add support for reading multiple sorted files per bucket#730
rahij wants to merge 8 commits intomasterfrom
rr/sorted-merge

rahij commented Feb 10, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rahij commented Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahij commented Feb 10, 2021 •

edited

Loading