[SPARK-7826][CORE] Suppress extra calling getCacheLocs. by ueshin · Pull Request #6352 · apache/spark

ueshin · 2015-05-22T10:12:43Z

There are too many extra call method getCacheLocs for DAGScheduler, which includes Akka communication.
To improve DAGScheduler performance, suppress extra calling the method.

In my application with over 1200 stages, the execution time became 3.8 min from 8.5 min with my patch.

srowen · 2015-05-22T10:31:10Z

Can you explain why it's valid to proceed without the call when there is 1 dependency?
Also it looks like you're adding calls to getCacheLocs actually. I don't see an explanation and the description isn't consistent with the change.

SparkQA · 2015-05-22T12:04:44Z

Test build #33339 has finished for PR 6352 at commit 9a80fad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2015-05-22T12:22:19Z

@srowen, Thank you for your checking.

To check the parent stages are missing or not, we only have to check the location of RDD that has ShuffleDependency because narrow-depending RDDs would be the same location, so I moved the calling getCacheLocs into the case shufDep.
I was afraid that the dependency graph would become bigger (if there are a lot of union or zipPartitions, etc.) so I added checking the location at the point the RDD has more than 1 dependencies.
- we might not need line 389 if we don't have to consider the case.

srowen · 2015-05-22T12:35:41Z

Is part 2 really just to be safe? It seems essential. Are you saying that only shuffle dependencies have more than 1 dependency? Also this adds a new call to all dependencies. Doesn't this mostly defeat the purpose? I am not an expert on this code but I am not sure the logic is clear here

ueshin · 2015-05-22T12:53:02Z

Calling will occur not for all RDDs in the stage but only when:

RDD has ShuffleDependency.
RDD has more than 1 dependencies regardless of the dependency type.

And I should have mentioned that calling for the same RDD is not a problem because the location is already cached.

JoshRosen · 2015-05-22T17:30:55Z

(Warning: drive-by comment; I'll look at this patch in more detail later)

One high-level comment:

For any patch which modifies scheduler internals, we should err on the side of extremely liberal commenting of code, even if this means paragraph-long comments. If it's tricky enough to merit a question in a GitHub code review, then it deserves a comment. For instance, the rdd.dependencies.size < 2 check could benefit from a nearby comment that explains why this is safe.

ueshin · 2015-05-23T00:53:08Z

Oops, I found that I misunderstood what the method getCacheLocs is doing here.
I'll change the way to suppress Akka communications in the next push, so please check this PR after that.

@JoshRosen, Thank you for your comment.
I'll add comments in the next push.

JoshRosen · 2015-05-23T04:40:37Z

As a general aside, I find getCacheLocs(rdd).contains(Nil) to be hard to understand to begin with. I think that this condition is meant to be read as "if at least one partition of this RDD is not cached anywhere...". Maybe this code would be easier to review / parse if we extracted this condition into a variable, perhaps a lazy val if we want to short-circuit, named rddHasUncachedPartitions, or !rddIsCached if we don't mind negation.

JoshRosen · 2015-05-23T05:34:57Z

Oh, one other thought: maybe a good exercise would be to attempt to write the Scaladoc comment for getMissingParentStages which describes, in prose, the basic high-level algorithm for finding missing parent stages. I can help with this tomorrow. Even if you don't end up modifying getMissingParentStages, I'd love to submit a new PR that just comments / explains the existing code in order to make this easier to understand in the future.

To help me build some intuition for understanding your optimization here:

It looks like this only save us from performing getCacheLocs lookups in cases where we're traversing backwards through a long chain of narrow dependencies. I don't think that this is necessarily safe. Imagine that we have a lineage graph which looks something like this:

┌───┐ shuffle ┌───┐    ┌───┐          
│ A │◀ ─ ─ ─ ─│ B │◀───│ C │◀─┐       
└───┘         └───┘    └───┘  │  ┌───┐
                              ├──│ E │
                       ┌───┐  │  └───┘
                       │ D │◀─┘       
                       └───┘

Here, E has one-to-one dependencies on C and D. C is derived from A by performing a shuffle and then a map. If we're trying to determine which ancestor stages need to be computed in order to compute E, we need to figure out whether the shuffle A -> B should be performed. If the RDD C, which has only one ancestor via a narrow dependency, is cached, then we won't need to compute A, even if it has some unavailable output partitions. The same goes for B: if B is 100% cached, then we can avoid the shuffle on A. Based on this, I don't think that we can make a local decision to skip the caching check based on the structure of the RDD graph. However, we might be able to skip / optimize this check based on RDDs' storage levels: in long chains of narrow dependencies, most RDDs probably aren't cached, so adding a simple if StorageLevel = None return Seq.fill(numPartitions)(Nil) check to getCacheLocs might be safe / sufficient.

Someone more familiar with StorageLevel / caching semantics should double-check this reasoning to make sure that I'm not overlooking any corner-cases when RDDs' storage levels change due to unpersist / cache / persist calls.

JoshRosen · 2015-05-23T05:49:15Z

Also: if my above reasoning is right and this optimization is incorrect, then it's concerning that it didn't cause a test failure. My hunch is that we don't have unit tests for the particular combinations of RDD dependency graphs, caching states, and map output availability that would expose this issue. It would be nice to write a failing regression test which would have caught the problems in the current version of this patch, since that will help us to gain confidence that the new optimizations are safe.

ueshin · 2015-05-23T06:51:48Z

@JoshRosen Thank you for your details.
It is exactly that I was noticed yesterday.
I'm modifying DAGScheduler and adding the tests.
I'll push the next version as soon as possible.

This reverts commit 9a80fad.

…the RDD is not StorageLevel.NONE.

…t into variable for readability.

SparkQA · 2015-05-23T08:04:23Z

Test build #33402 has finished for PR 6352 at commit b9c835c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-23T10:07:27Z

Test build #33403 has finished for PR 6352 at commit 6f3125c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2015-05-23T11:12:54Z

I pushed and the test passed.
@JoshRosen, @srowen, Could you please take a look at this PR again?
Thanks.

JoshRosen · 2015-05-23T21:43:38Z

As a general style note, I'd try to avoid using return in Scala code, since there are some corner-cases where using it can lead to exception-handling issues (plus it results in slightly inefficient code which uses exceptions for control flow).

JoshRosen · 2015-05-23T21:49:38Z

Thanks for adding that test. This patch looks like it's in pretty good shape, but before we consider merging there's one or two other minor corner-cases that I'd like to explore.

In the current implementation of getCacheLocs, we first check to see whether the RDD's cache locations have been previously fetched; if so, we return the "cached" set of cache locations, and otherwise we fetch the set of locations from the block manager and store it in the cache. This patch's optimization takes place prior to checking / updating the cacheLocs map, meaning that it might slightly change behavior. Specifically, I'm wondering what happened in the old code if we called getCahceLocs() on an RDD that wasn't cached, then cached the RDD , forced it to be computed, then called getCacheLocs() again as part of a different job. In the old code, an empty set of cache locations would have been stored in the map on the first call, so I don't think the second call would see an updated set of cache locations unless we cleared the cache locations. In the new code, non-cached RDDs won't ever cause entries to be stored in cacheLocs, so it's possible that the effects of caching might become visible sooner after this patch than they would in the old code. This might be safe, but if we make this change then it should be deliberate / knowingly. If we want to be really conservative, maybe we should move the storageLevel check inside the if (!cacheLocs.contains(rdd.id) block in order to better preserve the old behavior.

JoshRosen · 2015-05-23T21:52:30Z

Actually, it looks like we end up calling clearCacheLocs() when submitting a new job, so the change described above probably doesn't make a difference. To be safe, though, and to eliminate the return, let's go ahead and move it into the cacheLocs-updating block. Once we do that, I think this will be good to go, but I'll probably pull in a scheduler maintainer for a final spot-check / review.

JoshRosen · 2015-05-23T22:19:08Z

To clarify for other reviewers, I think that we need these cache() calls so these other tests don't fail due to the skipping of the cached locations lookups.

…)) block.

ueshin · 2015-05-24T01:14:11Z

@JoshRosen, Thank you for your comment.
I agree with you and moved the storageLevel check into the if block.

SparkQA · 2015-05-24T03:05:33Z

Test build #33422 has finished for PR 6352 at commit d858b59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-05-24T04:39:49Z

LGTM. /ping @markhamstra or @kayousterhout for final sign-off on scheduler-related changes.

markhamstra · 2015-05-26T17:24:15Z

LGTM

kayousterhout · 2015-05-26T18:15:03Z

Why isn't D a missing parent stage here?

It looks like what happens is that the call to submit() causes the first set of missing parent stages to be submitted, so at that point, stage D is submitted. Can you add a comment describing this?

Since there's a one-to-one dependency from D to E, won't D and E be computed in the same stage?

Ah I see. What if we changed this test to, instead of directly calling getMissingParentStages, just directly inspect DAGScheduler.runningStages (since that's already private[scheduler]) to make sure it contains the one stage we expect? I'd find that more intuitive, since that more directly tests the underlying issue we're trying to verify.

That's a good idea; let's do this.

@kayousterhout, Thank you for your checking this PR.
I see, and should I revert getMissingParentStages to private ?

Yes, if we're not going to use it in the test suite, then it should go back to private.

Ah, I found that only checking if the DAGScheduler.runningStages contains one stage is not enough because it also contains one stage including A if the C is not cached yet.
I think we should also check the size of the final stage's missing parents.

I was thinking you could inspect the contents of the stages in runningStages to make sure the Id is correct

Sent from my iPhone

On May 26, 2015, at 7:53 PM, Takuya UESHIN notifications@github.com wrote:

In core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:

* If the RDD C, which has only one ancestor via a narrow dependency, is cached, then we won't

* need to compute A, even if it has some unavailable output partitions. The same goes for B:

* if B is 100% cached, then we can avoid the shuffle on A.

*/

test("SPARK-7826: regression test for getMissingParentStages") {

val rddA = new MyRDD(sc, 1, Nil)

val rddB = new MyRDD(sc, 1, List(new ShuffleDependency(rddA, null)))

val rddC = new MyRDD(sc, 1, List(new OneToOneDependency(rddB))).cache()

val rddD = new MyRDD(sc, 1, Nil)

val rddE = new MyRDD(sc, 1,

List(new OneToOneDependency(rddC), new OneToOneDependency(rddD)))

cacheLocations(rddC.id -> 0) =

Seq(makeBlockManagerId("hostA"), makeBlockManagerId("hostB"))

val jobId = submit(rddE, Array(0))

val finalStage = scheduler.jobIdToActiveJob(jobId).finalStage

assert(scheduler.getMissingParentStages(finalStage).size === 0)
Ah, I found that only checking if the DAGScheduler.runningStages contains one stage is not enough because it also contains one stage including A if the C is not cached yet.
I think we should also check the size of the final stage's missing parents.

—
Reply to this email directly or view it on GitHub.

Ah, the runningStages contains one stage and it's id is 1, right?

ueshin · 2015-05-27T03:20:26Z

I modified the unit test.
Thank you all for your instructions.

SparkQA · 2015-05-27T05:10:19Z

Test build #33559 has finished for PR 6352 at commit 10b1b22.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2015-05-27T07:24:13Z

Retest this please.

kayousterhout · 2015-05-27T07:44:07Z

Jenkins, retest this please

SparkQA · 2015-05-27T09:47:48Z

Test build #33575 has finished for PR 6352 at commit 10b1b22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2015-05-27T18:36:53Z

Can you actually change this to:
assert(scheduler.runningStages.head.isInstanceOf[ResultStage])?

And then add a comment saying something like "Make sure that the scheduler is running the final result stage. Because C is cached, the shuffle map stage to compute A does not need to be run."

(I think this is more intuitive; otherwise, it's hard for someone looking at this to understand why the ID should be 1. This also makes the test more agnostic to unrelated scheduler internals, like if we change the way we assign IDs to stages)

kayousterhout · 2015-05-27T18:44:39Z

Just a few more comments on improving the documentation and understandability of the test. @JoshRosen has recently pointed out that the schedule code is extremely difficult to understand and check for correctness, and I think having easily understandable and well-documented tests is a step towards making the scheduler code more friendly.

ueshin · 2015-05-28T00:12:11Z

I modified what you mentioned.
I agree with the difficulties, so please let me know if there are other things I can do here.
Thanks.

SparkQA · 2015-05-28T00:15:28Z

Test build #33629 has finished for PR 6352 at commit 3d4d036.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2015-05-28T00:18:26Z

Retest this please.

SparkQA · 2015-05-28T01:38:59Z

Test build #33631 has finished for PR 6352 at commit 3d4d036.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2015-05-28T02:27:01Z

Jenkins, retest this please.

SparkQA · 2015-05-28T04:14:57Z

Test build #33637 has finished for PR 6352 at commit 3d4d036.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2015-05-28T21:54:53Z

LGTM

kayousterhout · 2015-05-29T02:15:05Z

Thanks @ueshin ! I merged this since @JoshRosen and @markhamstra LGTM'ed a while ago.

ueshin · 2015-05-29T02:23:08Z

@kayousterhout, Thank you for merging this!

There are too many extra call method `getCacheLocs` for `DAGScheduler`, which includes Akka communication. To improve `DAGScheduler` performance, suppress extra calling the method. In my application with over 1200 stages, the execution time became 3.8 min from 8.5 min with my patch. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#6352 from ueshin/issues/SPARK-7826 and squashes the following commits: 3d4d036 [Takuya UESHIN] Modify a test and the documentation. 10b1b22 [Takuya UESHIN] Simplify the unit test. d858b59 [Takuya UESHIN] Move the storageLevel check inside the if (!cacheLocs.contains(rdd.id)) block. 6f3125c [Takuya UESHIN] Fix scalastyle. b9c835c [Takuya UESHIN] Put the condition that checks if the RDD has uncached partition or not into variable for readability. f87f2ec [Takuya UESHIN] Get cached locations from block manager only if the storage level of the RDD is not StorageLevel.NONE. 8248386 [Takuya UESHIN] Revert "Suppress extra calling getCacheLocs." a4d944a [Takuya UESHIN] Add an unit test. 9a80fad [Takuya UESHIN] Suppress extra calling getCacheLocs.

Suppress extra calling getCacheLocs.

9a80fad

ueshin changed the title ~~[SPARK-7826][CORE] Suppress extra calling getCacheLocs.~~ [WIP][SPARK-7826][CORE] Suppress extra calling getCacheLocs. May 23, 2015

JoshRosen reviewed May 23, 2015
View reviewed changes

ueshin added 4 commits May 23, 2015 16:54

Add an unit test.

a4d944a

Revert "Suppress extra calling getCacheLocs."

8248386

This reverts commit 9a80fad.

Get cached locations from block manager only if the storage level of …

f87f2ec

…the RDD is not StorageLevel.NONE.

Put the condition that checks if the RDD has uncached partition or no…

b9c835c

…t into variable for readability.

Fix scalastyle.

6f3125c

ueshin changed the title ~~[WIP][SPARK-7826][CORE] Suppress extra calling getCacheLocs.~~ [SPARK-7826][CORE] Suppress extra calling getCacheLocs. May 23, 2015

JoshRosen reviewed May 23, 2015
View reviewed changes

Move the storageLevel check inside the if (!cacheLocs.contains(rdd.id…

d858b59

…)) block.

kayousterhout reviewed May 26, 2015
View reviewed changes

Simplify the unit test.

10b1b22

kayousterhout reviewed May 27, 2015
View reviewed changes

Modify a test and the documentation.

3d4d036

asfgit closed this in 9b692bf May 29, 2015

JoshRosen mentioned this pull request Aug 13, 2015

[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs #8178

Closed

Conversation

ueshin commented May 22, 2015

Uh oh!

srowen commented May 22, 2015

Uh oh!

SparkQA commented May 22, 2015

Uh oh!

ueshin commented May 22, 2015

Uh oh!

srowen commented May 22, 2015

Uh oh!

ueshin commented May 22, 2015

Uh oh!

JoshRosen commented May 22, 2015

Uh oh!

ueshin commented May 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented May 23, 2015

Uh oh!

JoshRosen commented May 23, 2015

Uh oh!

ueshin commented May 23, 2015

Uh oh!

SparkQA commented May 23, 2015

Uh oh!

SparkQA commented May 23, 2015

Uh oh!

ueshin commented May 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented May 23, 2015

Uh oh!

JoshRosen commented May 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented May 24, 2015

Uh oh!

SparkQA commented May 24, 2015

Uh oh!

JoshRosen commented May 24, 2015

Uh oh!

markhamstra commented May 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented May 27, 2015

Uh oh!

SparkQA commented May 27, 2015

Uh oh!

ueshin commented May 27, 2015

Uh oh!

kayousterhout commented May 27, 2015

Uh oh!

SparkQA commented May 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment