Skip to content

Conversation

@JoshRosen
Copy link
Contributor

In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this:

  • Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side.
  • By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted.
  • We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying.

This patch removes Exchange's canSortWithShuffle path and the associated code in SqlSerializer2. Shuffles that used to go through the canSortWithShuffle path would always wind up using Spark's ExternalSorter (inside of HashShuffleReader); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default.

@JoshRosen
Copy link
Contributor Author

/cc @yhuai @marmbrus @rxin, this is a blocker for my Tungsten sorting patch for Spark SQL; the changes here will allow my SQL-specific sorter to be used in more circumstances where we sort.

@JoshRosen JoshRosen changed the title [SPARK-8317] Do not push sort into shuffle in Exchange operator [SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator Jun 12, 2015
@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34733 has finished for PR 6772 at commit ebf9c0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

LGTM

@marmbrus
Copy link
Contributor

Merging to master.

@asfgit asfgit closed this in b9d177c Jun 12, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this:

- Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side.
- By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted.
- We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying.

This patch removes Exchange's `canSortWithShuffle` path and the associated code in `SqlSerializer2`.  Shuffles that used to go through the `canSortWithShuffle` path would always wind up using Spark's `ExternalSorter` (inside of `HashShuffleReader`); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default.

Author: Josh Rosen <[email protected]>

Closes apache#6772 from JoshRosen/SPARK-8317 and squashes the following commits:

ebf9c0f [Josh Rosen] Do not push sort into shuffle in Exchange operator
bf3b4c8 [Josh Rosen] Enable external sort by default
@JoshRosen JoshRosen deleted the SPARK-8317 branch April 14, 2016 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants