[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs by JoshRosen · Pull Request #8178 · apache/spark

JoshRosen · 2015-08-13T21:52:42Z

In Scala, Seq.fill always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine):

val numItems = 100000
val s = Seq.fill(numItems)(1)
for (i <- 0 until numItems) s(i)

It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In getPreferredLocsInternal, there's a call to getCacheLocs(rdd)(partition). The getCacheLocs call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput.

This patch fixes this by replacing Seq with Array.

JoshRosen · 2015-08-13T21:56:30Z

The problematic Seq.fill was added in #6352, which was merged to 1.5.0, so I think that this patch needs to be merged to master and 1.5.0 in order to prevent a scheduling performance regression.

JoshRosen · 2015-08-13T22:20:53Z

I noticed this while running a very simple scheduling throughput benchmark under YourKit Java profiler with CPU tracing enabled. Here's a comparison of two trace results, clearly illustrating the slowdown:

For scheduling a job with 10000 no-op tasks, (sc.makeRDD(1 to NUM_TASKS, NUM_TASKS).mapPartitions(identity).count()), the end-to-end time in local[*] mode dropped from ~48 seconds to ~20 seconds as a result of this change.

JoshRosen · 2015-08-13T22:29:10Z

Also, note that the actual max scheduling throughput is much higher with tracing disabled; I can scheduler over 5000 tasks / second on my laptop.

SparkQA · 2015-08-14T00:53:41Z

Test build #40813 timed out for PR 8178 at commit fe918a9 after a configured wait of 175m.

JoshRosen · 2015-08-14T02:17:19Z

Jenkins, retest this please.

SparkQA · 2015-08-14T08:04:21Z

Test build #1595 timed out for PR 8178 at commit fe918a9 after a configured wait of 175m.

JoshRosen · 2015-08-14T15:31:57Z

Jenkins, retest this please.

SparkQA · 2015-08-14T18:08:02Z

Test build #40888 has finished for PR 8178 at commit fe918a9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-08-14T18:30:07Z

nice find. We could also use an IndexedSeq (which I think is what everyone really wants anytime they say Seq), but it looks like there are other places that want an array anyway, so makes sense to just use an array here. IndexedSeq would leave the door open for other optimizations -- eg., when the storage level is None, you don't actually need to create anything, you can just return a collection which knows its Nil in every slot ... but that would probably be premature optimization.

anyway, just some random thoughts. lgtm pending tests.

SparkQA · 2015-08-15T02:14:57Z

Test build #1615 has finished for PR 8178 at commit fe918a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-15T09:55:54Z

Test build #1624 has finished for PR 8178 at commit fe918a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-15T19:05:22Z

One potential gotcha of using arrays: we might run into problems with array equality checks returning false because they're based on the array identity rather than contents. Given this, maybe it would be safer to use IndexedSeq; the alternative would be to carefully search the existing code to figure out whether this change has accidentally broken any comparisons.

This reverts commit fe918a9.

SparkQA · 2015-08-15T20:05:52Z

Test build #40972 has finished for PR 8178 at commit 88710c1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-15T23:05:37Z

Test build #40974 has finished for PR 8178 at commit 44a15f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-16T21:09:55Z

I've gone ahead and minimized this to just the IndexedSeq change, which seems to address the performance issue, so I'm going to merge this to master and 1.5 pending Jenkins.

SparkQA · 2015-08-17T00:14:14Z

Test build #40998 timed out for PR 8178 at commit 6e5fdc2 after a configured wait of 175m.

squito · 2015-08-17T01:27:21Z

super nit, but any reason for all the size -> length changes? just seems like a bit of noise if we ever look in git history for these lines.

I'm not sure how true it is with more recent versions of Scala, but there at least was a time when Array#size didn't perform nearly as well as Array#length.

http://www.scala-lang.org/old/node/7650

squito · 2015-08-17T01:30:50Z

thanks for updating josh. still lgtm pending tests from me. (left one minor comment, your discretion to update).

squito · 2015-08-17T01:30:56Z

Jenkins, retest this please

SparkQA · 2015-08-17T04:34:04Z

Test build #41003 timed out for PR 8178 at commit 6e5fdc2 after a configured wait of 175m.

JoshRosen · 2015-08-17T15:05:35Z

Jenkins, retest this please.

SparkQA · 2015-08-17T17:52:03Z

Test build #41022 has finished for PR 8178 at commit 6e5fdc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-08-17T17:58:49Z

retest this please

andrewor14 · 2015-08-17T18:26:32Z

LGTM will merge once we pass tests.

SparkQA · 2015-08-17T20:26:31Z

Test build #41035 has finished for PR 8178 at commit 6e5fdc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-17T23:01:21Z

Test build #1633 has finished for PR 8178 at commit 6e5fdc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FilterNode(condition: Expression, child: LocalNode) extends UnaryLocalNode
- abstract class LocalNode extends TreeNode[LocalNode]
- abstract class LeafLocalNode extends LocalNode
- abstract class UnaryLocalNode extends LocalNode
- case class ProjectNode(projectList: Seq[NamedExpression], child: LocalNode) extends UnaryLocalNode
- case class SeqScanNode(output: Seq[Attribute], data: Seq[InternalRow]) extends LeafLocalNode
- public final class UTF8String implements Comparable<UTF8String>, Externalizable

SparkQA · 2015-08-17T23:17:22Z

Test build #1634 has finished for PR 8178 at commit 6e5fdc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-17T23:20:24Z

Test build #1635 has finished for PR 8178 at commit 6e5fdc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StringIndexerModel (
- implicit class StringToColumn(val sc: StringContext)
- case class FilterNode(condition: Expression, child: LocalNode) extends UnaryLocalNode
- abstract class LocalNode extends TreeNode[LocalNode]
- abstract class LeafLocalNode extends LocalNode
- abstract class UnaryLocalNode extends LocalNode
- case class ProjectNode(projectList: Seq[NamedExpression], child: LocalNode) extends UnaryLocalNode
- case class SeqScanNode(output: Seq[Attribute], data: Seq[InternalRow]) extends LeafLocalNode
- public final class UTF8String implements Comparable<UTF8String>, Externalizable

SparkQA · 2015-08-18T02:55:19Z

Test build #1638 has finished for PR 8178 at commit 6e5fdc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-18T03:24:36Z

Test build #1637 timed out for PR 8178 at commit 6e5fdc2 after a configured wait of 175m.

SparkQA · 2015-08-18T03:26:11Z

Test build #1639 timed out for PR 8178 at commit 6e5fdc2 after a configured wait of 175m.

JoshRosen · 2015-08-18T06:16:16Z

Jenkins, retest this please.

SparkQA · 2015-08-18T08:35:07Z

Test build #41095 has finished for PR 8178 at commit 6e5fdc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-18T16:01:16Z

Jenkins, retest this please.

SparkQA · 2015-08-18T19:01:04Z

Test build #41138 timed out for PR 8178 at commit 6e5fdc2 after a configured wait of 175m.

rxin · 2015-08-19T05:29:59Z

I'm going to merge this since the unit tests just took longer to run but actually the relevant tests passed.

…accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf. (cherry picked from commit 010b03e) Signed-off-by: Reynold Xin <rxin@databricks.com>

SparkQA · 2015-08-19T08:10:02Z

Test build #1663 has finished for PR 8178 at commit 6e5fdc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen added 3 commits August 13, 2015 11:49

Replace .size with .length

f2fd11e

Remove redundant .toArray conversions

b8e6552

Replace outer Seq with Array.

fe918a9

JoshRosen added 2 commits August 15, 2015 12:12

Revert "Replace outer Seq with Array."

6b592fc

This reverts commit fe918a9.

Use IndexedSeq instead

88710c1

Also update DAGSchedulerSuite test

44a15f6

Merge remote-tracking branch 'origin/master' into dagscheduler-perf

6e5fdc2

squito reviewed Aug 17, 2015
View reviewed changes

asfgit closed this in 010b03e Aug 19, 2015

JoshRosen deleted the dagscheduler-perf branch May 19, 2017 19:02

Conversation

JoshRosen commented Aug 13, 2015

Uh oh!

JoshRosen commented Aug 13, 2015

Uh oh!

JoshRosen commented Aug 13, 2015

Uh oh!

JoshRosen commented Aug 13, 2015

Uh oh!

SparkQA commented Aug 14, 2015

Uh oh!

JoshRosen commented Aug 14, 2015

Uh oh!

SparkQA commented Aug 14, 2015

Uh oh!

JoshRosen commented Aug 14, 2015

Uh oh!

SparkQA commented Aug 14, 2015

Uh oh!

squito commented Aug 14, 2015

Uh oh!

SparkQA commented Aug 15, 2015

Uh oh!

SparkQA commented Aug 15, 2015

Uh oh!

JoshRosen commented Aug 15, 2015

Uh oh!

SparkQA commented Aug 15, 2015

Uh oh!

SparkQA commented Aug 15, 2015

Uh oh!

JoshRosen commented Aug 16, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

squito Aug 17, 2015

Choose a reason for hiding this comment

Uh oh!

markhamstra Aug 17, 2015

Choose a reason for hiding this comment

Uh oh!

markhamstra Aug 17, 2015

Choose a reason for hiding this comment

Uh oh!

squito commented Aug 17, 2015

Uh oh!

squito commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

JoshRosen commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

andrewor14 commented Aug 17, 2015

Uh oh!

andrewor14 commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 18, 2015

Uh oh!

SparkQA commented Aug 18, 2015

Uh oh!

SparkQA commented Aug 18, 2015

Uh oh!

JoshRosen commented Aug 18, 2015

Uh oh!

SparkQA commented Aug 18, 2015

Uh oh!

JoshRosen commented Aug 18, 2015

Uh oh!

SparkQA commented Aug 18, 2015

Uh oh!