[SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed by the order of input partitions #20664

ala · 2018-02-23T14:45:24Z

What changes were proposed in this pull request?

The algorithm in DefaultPartitionCoalescer.setupGroups is responsible for picking preferred locations for coalesced partitions. It analyzes the preferred locations of input partitions. It starts by trying to create one partition for each unique location in the input. However, if the the requested number of coalesced partitions is higher that the number of unique locations, it has to pick duplicate locations.

Previously, the duplicate locations would be picked by iterating over the input partitions in order, and copying their preferred locations to coalesced partitions. If the input partitions were clustered by location, this could result in severe skew.

With the fix, instead of iterating over the list of input partitions in order, we pick them at random. It's not perfectly balanced, but it's much better.

How was this patch tested?

Unit test reproducing the behavior was added.

ala · 2018-02-23T14:45:32Z

@hvanhovell

mgaido91

As I commented on the JIRA I'd prefer another approach to tackle this.

Anyway, if this is not feasible, and we are going on with this approach, as of now this is a potential source for random processing times in users' workflows: ie. a user flow previously is likely to take always the same time to run. With this change, the same job can run with two very different timings. I am wondering if we can give the user some control on it, like a config property for setting the seed. What do you think?

mgaido91 · 2018-02-23T14:55:03Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

    }.collect()
  }

+  test("SPARK-23496: order of input partitions can result in severe skew in coalesce") {


this test looks to me as a good candidate for flakiness, since we are are picking random numbers. Can we set the seed in order to avoid this?

The test is in fact deterministic. The seed is already fixed here:

spark/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala

Line 163 in 049f243

val rnd = new scala.util.Random(7919) // keep this class deterministic

I see, thanks, sorry, I missed it

ala · 2018-02-23T15:56:24Z

Thanks for the comments.

I don't think the users should be impacted by changing execution time. If the parameters of the job are constant, then the partition allocation should also be deterministic, since the seed is fixed in CoalescedRDD.scala. There was already a degree of randomization in DefaultPartitionCoalescer.pickBin() which could lead to some fluctuation, so it's not a big difference.

TBH, I'm just trying to merge upstream a fix we've implemented for the client. I agree much more could be done to improve coalesce, and if someone would be interested in looking into it, I'm all for it.

SparkQA · 2018-02-23T18:32:00Z

Test build #87632 has finished for PR 20664 at commit 6d67dfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-02-25T15:05:12Z

core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala

    while (numCreated < targetLen) {
-      val (nxt_replica, nxt_part) = partitionLocs.partsWithLocs(tries)
-      tries += 1
+      val (nxt_replica, nxt_part) = partitionLocs.partsWithLocs(


Perhaps add comment to explain the purpose of this change here?

jiangxb1987 · 2018-02-25T15:09:40Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

+    // Without the fix these would be:
+    //  numPartsPerLocation(locations(0)) == numCoalescedPartitions - 1
+    //  numPartsPerLocation(locations(1)) == 1
+    assert(numPartsPerLocation(locations(0)) > 0.4 * numCoalescedPartitions)


How confident are we on the assert condition to be true? How is the fraction 0.4 chosen?

nvm, the result is deterministic.

Added comment about flakiness & fixed seed.

jiangxb1987 · 2018-02-25T15:11:36Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

+      .groupBy(identity)
+      .mapValues(_.size)
+
+    // Without the fix these would be:


Normally we don't write the comment this way, maybe just saying we want to ensure the location preferences distribute evenly.

jiangxb1987 · 2018-02-25T15:12:19Z

This LGTM overall, just some nits.

SparkQA · 2018-02-26T17:58:46Z

Test build #87672 has finished for PR 20664 at commit 0512736.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-02-27T04:28:09Z

retest this please

SparkQA · 2018-02-27T07:27:12Z

Test build #87697 has finished for PR 20664 at commit 0512736.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ala · 2018-02-27T10:28:33Z

retest this please

SparkQA · 2018-02-27T13:38:14Z

Test build #87715 has finished for PR 20664 at commit 0512736.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ala · 2018-02-28T11:25:15Z

retest this please

SparkQA · 2018-02-28T14:21:51Z

Test build #87771 has finished for PR 20664 at commit 0512736.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ala · 2018-03-01T10:36:57Z

retest this please

SparkQA · 2018-03-01T13:52:17Z

Test build #87827 has finished for PR 20664 at commit 0512736.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ala · 2018-03-02T16:00:11Z

retest this please

ala · 2018-03-05T09:31:16Z

retest this please

SparkQA · 2018-03-05T13:21:16Z

Test build #87954 has finished for PR 20664 at commit 0512736.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-03-05T13:32:04Z

Merging to master. Thanks!

@mgaido91 if you feel this should be different, feel free to open a follow-up.

…skewed by the order of input partitions ## What changes were proposed in this pull request? The algorithm in `DefaultPartitionCoalescer.setupGroups` is responsible for picking preferred locations for coalesced partitions. It analyzes the preferred locations of input partitions. It starts by trying to create one partition for each unique location in the input. However, if the the requested number of coalesced partitions is higher that the number of unique locations, it has to pick duplicate locations. Previously, the duplicate locations would be picked by iterating over the input partitions in order, and copying their preferred locations to coalesced partitions. If the input partitions were clustered by location, this could result in severe skew. With the fix, instead of iterating over the list of input partitions in order, we pick them at random. It's not perfectly balanced, but it's much better. ## How was this patch tested? Unit test reproducing the behavior was added. Author: Ala Luszczak <[email protected]> Closes apache#20664 from ala/SPARK-23496.

Fix SPARK-23496.

6d67dfc

mgaido91 reviewed Feb 23, 2018

View reviewed changes

jiangxb1987 reviewed Feb 25, 2018

View reviewed changes

Apply review remarks.

0512736

asfgit closed this in 42cf48e Mar 5, 2018

[SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed by the order of input partitions #20664

[SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed by the order of input partitions #20664

Uh oh!

Conversation

ala commented Feb 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ala commented Feb 23, 2018

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ala commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Feb 25, 2018

Uh oh!

SparkQA commented Feb 26, 2018

Uh oh!

jiangxb1987 commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

ala commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

ala commented Feb 28, 2018

Uh oh!

SparkQA commented Feb 28, 2018

Uh oh!

ala commented Mar 1, 2018

Uh oh!

SparkQA commented Mar 1, 2018

Uh oh!

ala commented Mar 2, 2018

Uh oh!

ala commented Mar 5, 2018

Uh oh!

SparkQA commented Mar 5, 2018

Uh oh!

hvanhovell commented Mar 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants