[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() #1978

davies · 2014-08-16T00:36:03Z

Using external sort to support sort large datasets in reduce stage.

JoshRosen · 2014-08-16T06:25:56Z

Jenkins, test this please.

SparkQA · 2014-08-16T06:30:04Z

QA tests have started for PR 1978 at commit 55602ee.

This patch merges cleanly.

SparkQA · 2014-08-16T07:24:05Z

QA tests have finished for PR 1978 at commit 55602ee.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Serializer
- abstract class SerializerInstance
- abstract class SerializationStream
- abstract class DeserializationStream
- class ExternalSorter(object):

JoshRosen · 2014-08-19T21:33:06Z

I think we also need to add a license statement to the LICENSE file (like we've done with CloudPickle and Py4J).

JoshRosen · 2014-08-19T21:36:20Z

python/pyspark/shuffle.py

As of my new PR, this will need to be changed to "SPARK_LOCAL_DIRS" (plural).

JoshRosen · 2014-08-19T21:38:51Z

Why do we need to use heapq3? Is there a way to support this feature using the standard Python 2.7 heapq?

davies · 2014-08-19T21:46:39Z

In Python 2.6/7, heapq.merge() do not support key and reverse.

change SPARK_LOCAL_DIR into SPARK_LOCAL_DIRS

davies · 2014-08-19T22:56:03Z

cc @mateiz

SparkQA · 2014-08-19T23:00:18Z

QA tests have started for PR 1978 at commit 644abaf.

This patch does not merge cleanly!

SparkQA · 2014-08-19T23:00:24Z

QA tests have finished for PR 1978 at commit 644abaf.

This patch fails unit tests.
This patch does not merge cleanly!

davies · 2014-08-20T00:03:12Z

Jenkins, retest this please.

Conflicts: python/pyspark/tests.py

SparkQA · 2014-08-20T00:05:29Z

QA tests have started for PR 1978 at commit 644abaf.

This patch does not merge cleanly!

SparkQA · 2014-08-20T00:05:36Z

QA tests have finished for PR 1978 at commit 644abaf.

This patch fails unit tests.
This patch does not merge cleanly!

SparkQA · 2014-08-20T00:10:21Z

QA tests have started for PR 1978 at commit eb53ca6.

This patch merges cleanly.

SparkQA · 2014-08-20T00:10:27Z

QA tests have finished for PR 1978 at commit eb53ca6.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExternalSorter(object):

JoshRosen · 2014-08-20T00:16:29Z

Test failure is due to RAT complaining about some temporary files left behind by another test:

=========================================================================
Running Apache RAT checks
=========================================================================
Could not find Apache license headers in the following files:
 !????? /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/checkpoint/.temp.crc
 !????? /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/checkpoint/temp

There's a few ways to fix this:

Add an exclude to .rat-excludes
Modify the pull request builder to use git clean to remove these untracked files from the working tree.
Configure RAT to only check files tracked by git (by using git ls-files)

JoshRosen · 2014-08-20T00:20:50Z

@massie @jey @pwendell Is there a reason why it would be unsafe to run git clean in the pull request builder? Would this inadvertently delete any files that it needs, such as Spark configurations?

JoshRosen · 2014-08-20T00:23:39Z

I ssh'd into the worker and deleted that checkpoint directory, so maybe it will work now.

Jenkins, retest this please.

davies · 2014-08-20T00:51:12Z

Jenkins, retest this please

SparkQA · 2014-08-20T00:55:29Z

QA tests have started for PR 1978 at commit eb53ca6.

This patch merges cleanly.

SparkQA · 2014-08-20T01:53:43Z

QA tests have finished for PR 1978 at commit eb53ca6.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: python/pyspark/rdd.py python/pyspark/shuffle.py

davies · 2014-08-20T23:31:05Z

Jenkins, test this please.

SparkQA · 2014-08-20T23:35:37Z

QA tests have started for PR 1978 at commit 1f075ed.

This patch merges cleanly.

SparkQA · 2014-08-21T00:32:24Z

QA tests have finished for PR 1978 at commit 1f075ed.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)
- class ExternalSorter(object):

mateiz · 2014-08-25T23:27:37Z

python/pyspark/shuffle.py

Because there will be multiple Python worker processes running on the same node, if they all need to spill, it looks like they'll use the same directories in order here. Can you instead start each one at a random ID and then increment that to have it cycle through?

I'm not sure whether this can also affect the external hashing code, but if so, it would be good to fix that too (as a separate JIRA).

Basically I'm worried that everyone writes to disk1 first, then everyone writes to disk2, etc, and we only use one disk at a time.

Good catch, maybe shuffling the directories randomly in the begging would be better.

PS: Could you have a configured policy to choose local disks, such as use the first one AMAP, it's will be useful when one of the local disks is SSD.

Yeah good question. We don't have that yet, but in the future we'll have support for multiple local storage levels.

davies · 2014-08-26T00:18:34Z

@mateiz I had addressed above comments, it also fix the same problem for external merger, please take another look again, thx.

SparkQA · 2014-08-26T00:20:47Z

QA tests have started for PR 1978 at commit b125d2f.

This patch merges cleanly.

SparkQA · 2014-08-26T01:15:45Z

QA tests have finished for PR 1978 at commit b125d2f.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExternalSorter(object):

mateiz · 2014-08-26T18:28:23Z

python/pyspark/tests.py

Have you tested that this actually spills any data? I guess it does because the bare Python interpreter already consumes more than 1 MB?

mateiz · 2014-08-26T18:28:54Z

Looks pretty good, just added one question about the test

davies · 2014-08-26T19:03:48Z

@mateiz added checking for spilled bytes in tests.

SparkQA · 2014-08-26T19:06:20Z

QA tests have started for PR 1978 at commit bbcd9ba.

This patch merges cleanly.

SparkQA · 2014-08-26T20:02:12Z

QA tests have finished for PR 1978 at commit bbcd9ba.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
- $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
- case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)
- In multiclass classification, all$2^`
- public final class JavaDecisionTree
- class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable
- class BoundedFloat(float):
- class ExternalSorter(object):
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression
- case class ExplainCommand(plan: LogicalPlan, extended: Boolean = false) extends Command

mateiz · 2014-08-26T23:56:24Z

Cool, thanks! Going to merge this.

mateiz · 2014-08-26T23:57:26Z

BTW can you send a PR for the randomizing change to branch-1.1? I don't think we'll add sorting in branch-1.1 since it's a new feature, but we can add that randomizing patch as a bug fix. Or do you think it won't matter much?

davies · 2014-08-27T01:09:33Z

@mateiz PR #2152

Using external sort to support sort large datasets in reduce stage. Author: Davies Liu <[email protected]> Closes apache#1978 from davies/sort and squashes the following commits: bbcd9ba [Davies Liu] check spilled bytes in tests b125d2f [Davies Liu] add test for external sort in rdd eae0176 [Davies Liu] choose different disks from different processes and instances 1f075ed [Davies Liu] Merge branch 'master' into sort eb53ca6 [Davies Liu] Merge branch 'master' into sort 644abaf [Davies Liu] add license in LICENSE 19f7873 [Davies Liu] improve tests 55602ee [Davies Liu] use external sort in sortBy() and sortByKey()

use external sort in sortBy() and sortByKey()

55602ee

davies mentioned this pull request Aug 16, 2014

[SPARK-3074] [PySpark] support groupByKey() with single huge key #1977

Closed

JoshRosen reviewed Aug 19, 2014
View reviewed changes

python/pyspark/shuffle.py Outdated

Copy link

Contributor

JoshRosen Aug 19, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of my new PR, this will need to be changed to "SPARK_LOCAL_DIRS" (plural).

davies added 2 commits August 19, 2014 15:09

improve tests

19f7873

add license in LICENSE

644abaf

change SPARK_LOCAL_DIR into SPARK_LOCAL_DIRS

Merge branch 'master' into sort

eb53ca6

Conflicts: python/pyspark/tests.py

JoshRosen mentioned this pull request Aug 20, 2014

[HOTFIX][Streaming][MLlib] use temp folder for checkpoint #2046

Closed

Merge branch 'master' into sort

1f075ed

Conflicts: python/pyspark/rdd.py python/pyspark/shuffle.py

mateiz reviewed Aug 25, 2014
View reviewed changes

davies added 2 commits August 25, 2014 17:05

choose different disks from different processes and instances

eae0176

add test for external sort in rdd

b125d2f

mateiz reviewed Aug 26, 2014
View reviewed changes

check spilled bytes in tests

bbcd9ba

asfgit closed this in f1e71d4 Aug 27, 2014

davies deleted the sort branch September 15, 2014 22:19

[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() #1978

[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() #1978

Uh oh!

Conversation

davies commented Aug 16, 2014

Uh oh!

JoshRosen commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

JoshRosen commented Aug 19, 2014

Uh oh!

JoshRosen Aug 19, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 19, 2014

Uh oh!

davies commented Aug 19, 2014

Uh oh!

davies commented Aug 19, 2014

Uh oh!

SparkQA commented Aug 19, 2014

Uh oh!

SparkQA commented Aug 19, 2014

Uh oh!

davies commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

JoshRosen commented Aug 20, 2014

Uh oh!

JoshRosen commented Aug 20, 2014

Uh oh!

JoshRosen commented Aug 20, 2014

Uh oh!

davies commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

davies commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 21, 2014

Uh oh!

mateiz Aug 25, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Aug 25, 2014

Choose a reason for hiding this comment

Uh oh!

davies Aug 25, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Aug 26, 2014

Choose a reason for hiding this comment

Uh oh!

davies commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

mateiz Aug 26, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Aug 26, 2014

Uh oh!

davies commented Aug 26, 2014

Uh oh!