Skip to content

Conversation

@ala
Copy link
Contributor

@ala ala commented Feb 23, 2018

What changes were proposed in this pull request?

The algorithm in DefaultPartitionCoalescer.setupGroups is responsible for picking preferred locations for coalesced partitions. It analyzes the preferred locations of input partitions. It starts by trying to create one partition for each unique location in the input. However, if the the requested number of coalesced partitions is higher that the number of unique locations, it has to pick duplicate locations.

Previously, the duplicate locations would be picked by iterating over the input partitions in order, and copying their preferred locations to coalesced partitions. If the input partitions were clustered by location, this could result in severe skew.

With the fix, instead of iterating over the list of input partitions in order, we pick them at random. It's not perfectly balanced, but it's much better.

How was this patch tested?

Unit test reproducing the behavior was added.

@ala
Copy link
Contributor Author

ala commented Feb 23, 2018

@hvanhovell

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented on the JIRA I'd prefer another approach to tackle this.

Anyway, if this is not feasible, and we are going on with this approach, as of now this is a potential source for random processing times in users' workflows: ie. a user flow previously is likely to take always the same time to run. With this change, the same job can run with two very different timings. I am wondering if we can give the user some control on it, like a config property for setting the seed. What do you think?

}.collect()
}

test("SPARK-23496: order of input partitions can result in severe skew in coalesce") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test looks to me as a good candidate for flakiness, since we are are picking random numbers. Can we set the seed in order to avoid this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is in fact deterministic. The seed is already fixed here:

val rnd = new scala.util.Random(7919) // keep this class deterministic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks, sorry, I missed it

@ala
Copy link
Contributor Author

ala commented Feb 23, 2018

Thanks for the comments.

I don't think the users should be impacted by changing execution time. If the parameters of the job are constant, then the partition allocation should also be deterministic, since the seed is fixed in CoalescedRDD.scala. There was already a degree of randomization in DefaultPartitionCoalescer.pickBin() which could lead to some fluctuation, so it's not a big difference.

TBH, I'm just trying to merge upstream a fix we've implemented for the client. I agree much more could be done to improve coalesce, and if someone would be interested in looking into it, I'm all for it.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #87632 has finished for PR 20664 at commit 6d67dfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

while (numCreated < targetLen) {
val (nxt_replica, nxt_part) = partitionLocs.partsWithLocs(tries)
tries += 1
val (nxt_replica, nxt_part) = partitionLocs.partsWithLocs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add comment to explain the purpose of this change here?

// Without the fix these would be:
// numPartsPerLocation(locations(0)) == numCoalescedPartitions - 1
// numPartsPerLocation(locations(1)) == 1
assert(numPartsPerLocation(locations(0)) > 0.4 * numCoalescedPartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are we on the assert condition to be true? How is the fraction 0.4 chosen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, the result is deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment about flakiness & fixed seed.

.groupBy(identity)
.mapValues(_.size)

// Without the fix these would be:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we don't write the comment this way, maybe just saying we want to ensure the location preferences distribute evenly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jiangxb1987
Copy link
Contributor

This LGTM overall, just some nits.

@SparkQA
Copy link

SparkQA commented Feb 26, 2018

Test build #87672 has finished for PR 20664 at commit 0512736.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87697 has finished for PR 20664 at commit 0512736.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ala
Copy link
Contributor Author

ala commented Feb 27, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87715 has finished for PR 20664 at commit 0512736.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ala
Copy link
Contributor Author

ala commented Feb 28, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87771 has finished for PR 20664 at commit 0512736.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ala
Copy link
Contributor Author

ala commented Mar 1, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87827 has finished for PR 20664 at commit 0512736.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ala
Copy link
Contributor Author

ala commented Mar 2, 2018

retest this please

1 similar comment
@ala
Copy link
Contributor Author

ala commented Mar 5, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 5, 2018

Test build #87954 has finished for PR 20664 at commit 0512736.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master. Thanks!

@mgaido91 if you feel this should be different, feel free to open a follow-up.

@asfgit asfgit closed this in 42cf48e Mar 5, 2018
mgaido91 pushed a commit to mgaido91/spark that referenced this pull request Mar 5, 2018
…skewed by the order of input partitions

## What changes were proposed in this pull request?

The algorithm in `DefaultPartitionCoalescer.setupGroups` is responsible for picking preferred locations for coalesced partitions. It analyzes the preferred locations of input partitions. It starts by trying to create one partition for each unique location in the input. However, if the the requested number of coalesced partitions is higher that the number of unique locations, it has to pick duplicate locations.

Previously, the duplicate locations would be picked by iterating over the input partitions in order, and copying their preferred locations to coalesced partitions. If the input partitions were clustered by location, this could result in severe skew.

With the fix, instead of iterating over the list of input partitions in order, we pick them at random. It's not perfectly balanced, but it's much better.

## How was this patch tested?

Unit test reproducing the behavior was added.

Author: Ala Luszczak <[email protected]>

Closes apache#20664 from ala/SPARK-23496.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants