[SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator #6772

JoshRosen · 2015-06-12T00:15:53Z

In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this:

Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side.
By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted.
We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying.

This patch removes Exchange's canSortWithShuffle path and the associated code in SqlSerializer2. Shuffles that used to go through the canSortWithShuffle path would always wind up using Spark's ExternalSorter (inside of HashShuffleReader); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default.

JoshRosen · 2015-06-12T01:23:59Z

/cc @yhuai @marmbrus @rxin, this is a blocker for my Tungsten sorting patch for Spark SQL; the changes here will allow my SQL-specific sorter to be used in more circumstances where we sort.

SparkQA · 2015-06-12T02:04:07Z

Test build #34733 has finished for PR 6772 at commit ebf9c0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-06-12T02:07:05Z

LGTM

marmbrus · 2015-06-12T05:14:59Z

Merging to master.

In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this: - Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side. - By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted. - We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying. This patch removes Exchange's `canSortWithShuffle` path and the associated code in `SqlSerializer2`. Shuffles that used to go through the `canSortWithShuffle` path would always wind up using Spark's `ExternalSorter` (inside of `HashShuffleReader`); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default. Author: Josh Rosen <[email protected]> Closes apache#6772 from JoshRosen/SPARK-8317 and squashes the following commits: ebf9c0f [Josh Rosen] Do not push sort into shuffle in Exchange operator bf3b4c8 [Josh Rosen] Enable external sort by default

JoshRosen added 2 commits June 11, 2015 17:07

Enable external sort by default

bf3b4c8

Do not push sort into shuffle in Exchange operator

ebf9c0f

JoshRosen changed the title ~~[SPARK-8317] Do not push sort into shuffle in Exchange operator~~ [SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator Jun 12, 2015

JoshRosen mentioned this pull request Jun 12, 2015

[SPARK-7078] [SPARK-7079] Binary processing sort for Spark SQL #6444

Closed

asfgit closed this in b9d177c Jun 12, 2015

JoshRosen mentioned this pull request Jul 6, 2015

[MINOR] [SQL] remove unused code in Exchange #7234

Closed

JoshRosen deleted the SPARK-8317 branch April 14, 2016 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator #6772

[SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator #6772

Uh oh!

JoshRosen commented Jun 12, 2015

Uh oh!

JoshRosen commented Jun 12, 2015

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator #6772

[SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator #6772

Uh oh!

Conversation

JoshRosen commented Jun 12, 2015

Uh oh!

JoshRosen commented Jun 12, 2015

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants