[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs#8178
[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs#8178JoshRosen wants to merge 7 commits intoapache:masterfrom
Conversation
|
The problematic |
|
I noticed this while running a very simple scheduling throughput benchmark under YourKit Java profiler with CPU tracing enabled. Here's a comparison of two trace results, clearly illustrating the slowdown: For scheduling a job with 10000 no-op tasks, ( |
|
Also, note that the actual max scheduling throughput is much higher with tracing disabled; I can scheduler over 5000 tasks / second on my laptop. |
|
Test build #40813 timed out for PR 8178 at commit |
|
Jenkins, retest this please. |
|
Test build #1595 timed out for PR 8178 at commit |
|
Jenkins, retest this please. |
|
Test build #40888 has finished for PR 8178 at commit
|
|
nice find. We could also use an anyway, just some random thoughts. lgtm pending tests. |
|
Test build #1615 has finished for PR 8178 at commit
|
|
Test build #1624 has finished for PR 8178 at commit
|
|
One potential gotcha of using arrays: we might run into problems with array equality checks returning |
This reverts commit fe918a9.
|
Test build #40972 has finished for PR 8178 at commit
|
|
Test build #40974 has finished for PR 8178 at commit
|
|
I've gone ahead and minimized this to just the |
|
Test build #40998 timed out for PR 8178 at commit |
There was a problem hiding this comment.
super nit, but any reason for all the size -> length changes? just seems like a bit of noise if we ever look in git history for these lines.
There was a problem hiding this comment.
I'm not sure how true it is with more recent versions of Scala, but there at least was a time when Array#size didn't perform nearly as well as Array#length.
There was a problem hiding this comment.
|
thanks for updating josh. still lgtm pending tests from me. (left one minor comment, your discretion to update). |
|
Jenkins, retest this please |
|
Test build #41003 timed out for PR 8178 at commit |
|
Jenkins, retest this please. |
|
Test build #41022 has finished for PR 8178 at commit
|
|
retest this please |
|
LGTM will merge once we pass tests. |
|
Test build #41035 has finished for PR 8178 at commit
|
|
Test build #1633 has finished for PR 8178 at commit
|
|
Test build #1634 has finished for PR 8178 at commit
|
|
Test build #1635 has finished for PR 8178 at commit
|
|
Test build #1638 has finished for PR 8178 at commit
|
|
Test build #1637 timed out for PR 8178 at commit |
|
Test build #1639 timed out for PR 8178 at commit |
|
Jenkins, retest this please. |
|
Test build #41095 has finished for PR 8178 at commit
|
|
Jenkins, retest this please. |
|
Test build #41138 timed out for PR 8178 at commit |
|
I'm going to merge this since the unit tests just took longer to run but actually the relevant tests passed. |
…accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf. (cherry picked from commit 010b03e) Signed-off-by: Reynold Xin <rxin@databricks.com>
|
Test build #1663 has finished for PR 8178 at commit
|

In Scala,
Seq.fillalways seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine):It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In
getPreferredLocsInternal, there's a call togetCacheLocs(rdd)(partition). ThegetCacheLocscall returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput.This patch fixes this by replacing
SeqwithArray.