[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks #21019

jinxing64 · 2018-04-10T03:39:33Z

What changes were proposed in this pull request?

SparkContext submitted a map stage from submitMapStage to DAGScheduler,
markMapStageJobAsFinished is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:

stage0 and stage1 are all ShuffleMapStage and stage1 depends on stage0;
We submit stage1 by submitMapStage;
When stage 1 running, FetchFailed happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1;
When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside runningStages. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called;
stage0_1 finished, stage1_1 starts running. When submitMissingTasks, there is no missing tasks. But in current code, job listener is not triggered.

We should call the job listener for map stage in 5.

How was this patch tested?

Not added yet.

SparkQA · 2018-04-10T07:05:02Z

Test build #89085 has finished for PR 21019 at commit 685124a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2018-04-10T08:22:58Z

Jenkins, retest this please

SparkQA · 2018-04-10T11:53:57Z

Test build #89097 has finished for PR 21019 at commit 685124a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2018-04-10T12:17:28Z

Jenkins, retest this please

SparkQA · 2018-04-10T15:56:01Z

Test build #89117 has finished for PR 21019 at commit 685124a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2018-04-11T01:56:36Z

Jenkins, retest this please

SparkQA · 2018-04-11T06:17:19Z

Test build #89165 has finished for PR 21019 at commit 685124a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2018-04-11T08:08:40Z

@squito @vanzin @cloud-fan
How do you think this change ?

cloud-fan · 2018-04-11T10:48:07Z

cc @jiangxb1987

squito · 2018-04-11T15:43:11Z

I need to look more closely at the change, but your description of the problem makes sense. Can you also add a test case?

jinxing64 · 2018-04-12T09:23:10Z

@squito
Thanks a lot. I will add a test.

SparkQA · 2018-04-12T17:17:39Z

Test build #89264 has finished for PR 21019 at commit 9d369f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

This seems to a reasonable change, just some nits.

jiangxb1987 · 2018-04-16T15:14:43Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala


+  private[scheduler] def markMapStageJobsAsFinished(shuffleStage: ShuffleMapStage): Unit = {
+    // Mark any map-stage jobs waiting on this stage as finished
+    if (shuffleStage.isAvailable && shuffleStage.mapStageJobs.nonEmpty) {


Why do we need to double check that shuffleStage.isAvailable here?

doesn't seem this is necessary, as its already handled at the callsites ... but IMO its seems safer to include it, in case this gets invoked elsewhere in the future.

jiangxb1987 · 2018-04-16T15:16:34Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+      Success,
+      makeMapStatus("hostD", rdd2.partitions.length)))
+    // stage1 listener still should not have a result, though there's no missing partitions
+    // in it. Because stage1 is not inside runningStages at this moment.


nit: Because stage1 has been failed and is not inside `runningStages` at this moment.

squito

lgtm, just some very minor comments.

squito · 2018-04-16T21:20:14Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala


+  private[scheduler] def markMapStageJobsAsFinished(shuffleStage: ShuffleMapStage): Unit = {
+    // Mark any map-stage jobs waiting on this stage as finished
+    if (shuffleStage.isAvailable && shuffleStage.mapStageJobs.nonEmpty) {


doesn't seem this is necessary, as its already handled at the callsites ... but IMO its seems safer to include it, in case this gets invoked elsewhere in the future.

squito · 2018-04-16T21:27:31Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+    complete(taskSets(2), Seq(
+      (Success, makeMapStatus("hostC", rdd2.partitions.length))))
+    assert(mapOutputTracker.getMapSizesByExecutorId(dep1.shuffleId, 0).map(_._1).toSet ===
+        HashSet(makeBlockManagerId("hostC"), makeBlockManagerId("hostB")))


can just use Set here

squito · 2018-04-16T21:32:38Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+    // After stage0 is finished, stage1 will be submitted and found there is no missing
+    // partitions in it. Then listener got triggered.
+    assert(listener2.results.size === 1)
+  }


can you also add assertDataStructuresEmpty() please? I know its not really related to your change but nice to include this in all the tests.

jinxing64 · 2018-04-17T03:08:59Z

Thanks comments from Imran and Xingbo.
I made some change and please take another look when you have time.

SparkQA · 2018-04-17T07:05:02Z

Test build #89426 has finished for PR 21019 at commit 42a9b2e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-04-17T07:36:27Z

LGTM

jiangxb1987 · 2018-04-17T07:36:34Z

retest this please

SparkQA · 2018-04-17T12:00:55Z

Test build #89437 has finished for PR 21019 at commit 42a9b2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-04-17T13:56:30Z

merged to master, thanks!

## What changes were proposed in this pull request? SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`, `markMapStageJobAsFinished` is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); But think about below scenario: 1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0; 2. We submit stage1 by `submitMapStage`; 3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1; 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside `runningStages`. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called; 5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, there is no missing tasks. But in current code, job listener is not triggered. We should call the job listener for map stage in `5`. ## How was this patch tested? Not added yet. Author: jinxing <[email protected]> Closes apache#21019 from jinxing64/SPARK-23948. (cherry picked from commit 3990daa)

squito · 2018-04-17T14:05:34Z

a few minutes after merging this I realized I should have also merged to branch 2.3. I don't see a way to do that without another PR. oops. I opened this, its a clean cherry-pick #21085

vanzin · 2018-04-17T16:24:09Z

I don't see a way to do that without another PR.

git cherry-pick -x -s && git push

squito · 2018-04-17T16:31:00Z

I guess I rely entirely on the merge script, but in these simple cases I should just do the push directly ...

jinxing64 · 2018-04-18T02:16:24Z

@squito @jiangxb1987
Thanks for merging.

[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks

685124a

add a test

9d369f8

jiangxb1987 reviewed Apr 16, 2018

View reviewed changes

squito approved these changes Apr 16, 2018

View reviewed changes

Resolve comments from squito and jiangxb1987.

42a9b2e

asfgit closed this in 3990daa Apr 17, 2018

squito mentioned this pull request Apr 17, 2018

[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks #21085

Closed

liutang123 mentioned this pull request Aug 24, 2018

[SPARK-25211][Core] speculation and fetch failed result in hang of job #22202

Closed

[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks #21019

[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks #21019

Uh oh!

Conversation

jinxing64 commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

jinxing64 commented Apr 10, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

jinxing64 commented Apr 10, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

jinxing64 commented Apr 11, 2018

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

jinxing64 commented Apr 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Apr 11, 2018

Uh oh!

squito commented Apr 11, 2018

Uh oh!

jinxing64 commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Apr 16, 2018

Choose a reason for hiding this comment

Uh oh!

squito Apr 16, 2018

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Apr 16, 2018

Choose a reason for hiding this comment

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

squito Apr 16, 2018

Choose a reason for hiding this comment

Uh oh!

squito Apr 16, 2018

Choose a reason for hiding this comment

Uh oh!

squito Apr 16, 2018

Choose a reason for hiding this comment

Uh oh!

jinxing64 commented Apr 17, 2018

Uh oh!

SparkQA commented Apr 17, 2018

Uh oh!

jiangxb1987 commented Apr 17, 2018

Uh oh!

jiangxb1987 commented Apr 17, 2018

Uh oh!

SparkQA commented Apr 17, 2018

Uh oh!

squito commented Apr 17, 2018

Uh oh!

squito commented Apr 17, 2018

Uh oh!

vanzin commented Apr 17, 2018

Uh oh!

squito commented Apr 17, 2018

Uh oh!

jinxing64 commented Apr 18, 2018

Uh oh!

Reviewers

Assignees

jinxing64 commented Apr 10, 2018 •

edited

Loading

jinxing64 commented Apr 11, 2018 •

edited

Loading