[SPARK-7826][CORE] Suppress extra calling getCacheLocs.#6352
[SPARK-7826][CORE] Suppress extra calling getCacheLocs.#6352ueshin wants to merge 9 commits intoapache:masterfrom
Conversation
|
Can you explain why it's valid to proceed without the call when there is 1 dependency? |
|
Test build #33339 has finished for PR 6352 at commit
|
|
@srowen, Thank you for your checking.
|
|
Is part 2 really just to be safe? It seems essential. Are you saying that only shuffle dependencies have more than 1 dependency? Also this adds a new call to all dependencies. Doesn't this mostly defeat the purpose? I am not an expert on this code but I am not sure the logic is clear here |
|
Calling will occur not for all RDDs in the stage but only when:
And I should have mentioned that calling for the same RDD is not a problem because the location is already cached. |
|
(Warning: drive-by comment; I'll look at this patch in more detail later) One high-level comment: For any patch which modifies scheduler internals, we should err on the side of extremely liberal commenting of code, even if this means paragraph-long comments. If it's tricky enough to merit a question in a GitHub code review, then it deserves a comment. For instance, the |
|
Oops, I found that I misunderstood what the method @JoshRosen, Thank you for your comment. |
There was a problem hiding this comment.
As a general aside, I find getCacheLocs(rdd).contains(Nil) to be hard to understand to begin with. I think that this condition is meant to be read as "if at least one partition of this RDD is not cached anywhere...". Maybe this code would be easier to review / parse if we extracted this condition into a variable, perhaps a lazy val if we want to short-circuit, named rddHasUncachedPartitions, or !rddIsCached if we don't mind negation.
|
Oh, one other thought: maybe a good exercise would be to attempt to write the Scaladoc comment for To help me build some intuition for understanding your optimization here: It looks like this only save us from performing Here, Someone more familiar with StorageLevel / caching semantics should double-check this reasoning to make sure that I'm not overlooking any corner-cases when RDDs' storage levels change due to unpersist / cache / persist calls. |
|
Also: if my above reasoning is right and this optimization is incorrect, then it's concerning that it didn't cause a test failure. My hunch is that we don't have unit tests for the particular combinations of RDD dependency graphs, caching states, and map output availability that would expose this issue. It would be nice to write a failing regression test which would have caught the problems in the current version of this patch, since that will help us to gain confidence that the new optimizations are safe. |
|
@JoshRosen Thank you for your details. |
|
Test build #33402 has finished for PR 6352 at commit
|
|
Test build #33403 has finished for PR 6352 at commit
|
|
I pushed and the test passed. |
There was a problem hiding this comment.
As a general style note, I'd try to avoid using return in Scala code, since there are some corner-cases where using it can lead to exception-handling issues (plus it results in slightly inefficient code which uses exceptions for control flow).
|
Thanks for adding that test. This patch looks like it's in pretty good shape, but before we consider merging there's one or two other minor corner-cases that I'd like to explore. In the current implementation of |
|
Actually, it looks like we end up calling |
There was a problem hiding this comment.
To clarify for other reviewers, I think that we need these cache() calls so these other tests don't fail due to the skipping of the cached locations lookups.
|
@JoshRosen, Thank you for your comment. |
|
Test build #33422 has finished for PR 6352 at commit
|
|
LGTM. /ping @markhamstra or @kayousterhout for final sign-off on scheduler-related changes. |
|
LGTM |
There was a problem hiding this comment.
Why isn't D a missing parent stage here?
There was a problem hiding this comment.
It looks like what happens is that the call to submit() causes the first set of missing parent stages to be submitted, so at that point, stage D is submitted. Can you add a comment describing this?
There was a problem hiding this comment.
Since there's a one-to-one dependency from D to E, won't D and E be computed in the same stage?
There was a problem hiding this comment.
Ah I see. What if we changed this test to, instead of directly calling getMissingParentStages, just directly inspect DAGScheduler.runningStages (since that's already private[scheduler]) to make sure it contains the one stage we expect? I'd find that more intuitive, since that more directly tests the underlying issue we're trying to verify.
There was a problem hiding this comment.
That's a good idea; let's do this.
There was a problem hiding this comment.
@kayousterhout, Thank you for your checking this PR.
I see, and should I revert getMissingParentStages to private ?
There was a problem hiding this comment.
Yes, if we're not going to use it in the test suite, then it should go back to private.
There was a problem hiding this comment.
Ah, I found that only checking if the DAGScheduler.runningStages contains one stage is not enough because it also contains one stage including A if the C is not cached yet.
I think we should also check the size of the final stage's missing parents.
There was a problem hiding this comment.
I was thinking you could inspect the contents of the stages in runningStages to make sure the Id is correct
Sent from my iPhone
On May 26, 2015, at 7:53 PM, Takuya UESHIN notifications@github.com wrote:
In core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
- * If the RDD C, which has only one ancestor via a narrow dependency, is cached, then we won't
- * need to compute A, even if it has some unavailable output partitions. The same goes for B:
- * if B is 100% cached, then we can avoid the shuffle on A.
- */
- test("SPARK-7826: regression test for getMissingParentStages") {
- val rddA = new MyRDD(sc, 1, Nil)
- val rddB = new MyRDD(sc, 1, List(new ShuffleDependency(rddA, null)))
- val rddC = new MyRDD(sc, 1, List(new OneToOneDependency(rddB))).cache()
- val rddD = new MyRDD(sc, 1, Nil)
- val rddE = new MyRDD(sc, 1,
List(new OneToOneDependency(rddC), new OneToOneDependency(rddD)))- cacheLocations(rddC.id -> 0) =
Seq(makeBlockManagerId("hostA"), makeBlockManagerId("hostB"))- val jobId = submit(rddE, Array(0))
- val finalStage = scheduler.jobIdToActiveJob(jobId).finalStage
- assert(scheduler.getMissingParentStages(finalStage).size === 0)
Ah, I found that only checking if the DAGScheduler.runningStages contains one stage is not enough because it also contains one stage including A if the C is not cached yet.
I think we should also check the size of the final stage's missing parents.—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Ah, the runningStages contains one stage and it's id is 1, right?
|
I modified the unit test. |
|
Test build #33559 has finished for PR 6352 at commit
|
|
Retest this please. |
|
Jenkins, retest this please |
|
Test build #33575 has finished for PR 6352 at commit
|
There was a problem hiding this comment.
Can you actually change this to:
assert(scheduler.runningStages.head.isInstanceOf[ResultStage])?
And then add a comment saying something like "Make sure that the scheduler is running the final result stage. Because C is cached, the shuffle map stage to compute A does not need to be run."
There was a problem hiding this comment.
(I think this is more intuitive; otherwise, it's hard for someone looking at this to understand why the ID should be 1. This also makes the test more agnostic to unrelated scheduler internals, like if we change the way we assign IDs to stages)
|
Just a few more comments on improving the documentation and understandability of the test. @JoshRosen has recently pointed out that the schedule code is extremely difficult to understand and check for correctness, and I think having easily understandable and well-documented tests is a step towards making the scheduler code more friendly. |
|
I modified what you mentioned. |
|
Test build #33629 has finished for PR 6352 at commit
|
|
Retest this please. |
|
Test build #33631 has finished for PR 6352 at commit
|
|
Jenkins, retest this please. |
|
Test build #33637 has finished for PR 6352 at commit
|
|
LGTM |
|
Thanks @ueshin ! I merged this since @JoshRosen and @markhamstra LGTM'ed a while ago. |
|
@kayousterhout, Thank you for merging this! |
There are too many extra call method `getCacheLocs` for `DAGScheduler`, which includes Akka communication. To improve `DAGScheduler` performance, suppress extra calling the method. In my application with over 1200 stages, the execution time became 3.8 min from 8.5 min with my patch. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#6352 from ueshin/issues/SPARK-7826 and squashes the following commits: 3d4d036 [Takuya UESHIN] Modify a test and the documentation. 10b1b22 [Takuya UESHIN] Simplify the unit test. d858b59 [Takuya UESHIN] Move the storageLevel check inside the if (!cacheLocs.contains(rdd.id)) block. 6f3125c [Takuya UESHIN] Fix scalastyle. b9c835c [Takuya UESHIN] Put the condition that checks if the RDD has uncached partition or not into variable for readability. f87f2ec [Takuya UESHIN] Get cached locations from block manager only if the storage level of the RDD is not StorageLevel.NONE. 8248386 [Takuya UESHIN] Revert "Suppress extra calling getCacheLocs." a4d944a [Takuya UESHIN] Add an unit test. 9a80fad [Takuya UESHIN] Suppress extra calling getCacheLocs.
There are too many extra call method `getCacheLocs` for `DAGScheduler`, which includes Akka communication. To improve `DAGScheduler` performance, suppress extra calling the method. In my application with over 1200 stages, the execution time became 3.8 min from 8.5 min with my patch. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#6352 from ueshin/issues/SPARK-7826 and squashes the following commits: 3d4d036 [Takuya UESHIN] Modify a test and the documentation. 10b1b22 [Takuya UESHIN] Simplify the unit test. d858b59 [Takuya UESHIN] Move the storageLevel check inside the if (!cacheLocs.contains(rdd.id)) block. 6f3125c [Takuya UESHIN] Fix scalastyle. b9c835c [Takuya UESHIN] Put the condition that checks if the RDD has uncached partition or not into variable for readability. f87f2ec [Takuya UESHIN] Get cached locations from block manager only if the storage level of the RDD is not StorageLevel.NONE. 8248386 [Takuya UESHIN] Revert "Suppress extra calling getCacheLocs." a4d944a [Takuya UESHIN] Add an unit test. 9a80fad [Takuya UESHIN] Suppress extra calling getCacheLocs.
There are too many extra call method
getCacheLocsforDAGScheduler, which includes Akka communication.To improve
DAGSchedulerperformance, suppress extra calling the method.In my application with over 1200 stages, the execution time became 3.8 min from 8.5 min with my patch.