Skip to content

Conversation

@jinxing64
Copy link

@jinxing64 jinxing64 commented Apr 10, 2018

What changes were proposed in this pull request?

SparkContext submitted a map stage from submitMapStage to DAGScheduler,
markMapStageJobAsFinished is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:

  1. stage0 and stage1 are all ShuffleMapStage and stage1 depends on stage0;
  2. We submit stage1 by submitMapStage;
  3. When stage 1 running, FetchFailed happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1;
  4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside runningStages. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called;
  5. stage0_1 finished, stage1_1 starts running. When submitMissingTasks, there is no missing tasks. But in current code, job listener is not triggered.

We should call the job listener for map stage in 5.

How was this patch tested?

Not added yet.

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89085 has finished for PR 21019 at commit 685124a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89097 has finished for PR 21019 at commit 685124a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89117 has finished for PR 21019 at commit 685124a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89165 has finished for PR 21019 at commit 685124a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Author

jinxing64 commented Apr 11, 2018

@squito @vanzin @cloud-fan
How do you think this change ?

@cloud-fan
Copy link
Contributor

cc @jiangxb1987

@squito
Copy link
Contributor

squito commented Apr 11, 2018

I need to look more closely at the change, but your description of the problem makes sense. Can you also add a test case?

@jinxing64
Copy link
Author

@squito
Thanks a lot. I will add a test.

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89264 has finished for PR 21019 at commit 9d369f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to a reasonable change, just some nits.


private[scheduler] def markMapStageJobsAsFinished(shuffleStage: ShuffleMapStage): Unit = {
// Mark any map-stage jobs waiting on this stage as finished
if (shuffleStage.isAvailable && shuffleStage.mapStageJobs.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to double check that shuffleStage.isAvailable here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem this is necessary, as its already handled at the callsites ... but IMO its seems safer to include it, in case this gets invoked elsewhere in the future.

Success,
makeMapStatus("hostD", rdd2.partitions.length)))
// stage1 listener still should not have a result, though there's no missing partitions
// in it. Because stage1 is not inside runningStages at this moment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Because stage1 has been failed and is not inside `runningStages` at this moment.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just some very minor comments.


private[scheduler] def markMapStageJobsAsFinished(shuffleStage: ShuffleMapStage): Unit = {
// Mark any map-stage jobs waiting on this stage as finished
if (shuffleStage.isAvailable && shuffleStage.mapStageJobs.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem this is necessary, as its already handled at the callsites ... but IMO its seems safer to include it, in case this gets invoked elsewhere in the future.

complete(taskSets(2), Seq(
(Success, makeMapStatus("hostC", rdd2.partitions.length))))
assert(mapOutputTracker.getMapSizesByExecutorId(dep1.shuffleId, 0).map(_._1).toSet ===
HashSet(makeBlockManagerId("hostC"), makeBlockManagerId("hostB")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just use Set here

// After stage0 is finished, stage1 will be submitted and found there is no missing
// partitions in it. Then listener got triggered.
assert(listener2.results.size === 1)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add assertDataStructuresEmpty() please? I know its not really related to your change but nice to include this in all the tests.

@jinxing64
Copy link
Author

Thanks comments from Imran and Xingbo.
I made some change and please take another look when you have time.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89426 has finished for PR 21019 at commit 42a9b2e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

LGTM

@jiangxb1987
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89437 has finished for PR 21019 at commit 42a9b2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Apr 17, 2018

merged to master, thanks!

@asfgit asfgit closed this in 3990daa Apr 17, 2018
squito pushed a commit to squito/spark that referenced this pull request Apr 17, 2018
## What changes were proposed in this pull request?

SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`,
`markMapStageJobAsFinished` is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:
1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0;
2. We submit stage1 by `submitMapStage`;
3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside `runningStages`. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called;
5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, there is no missing tasks. But in current code, job listener is not triggered.

We should call the job listener for map stage in `5`.

## How was this patch tested?

Not added yet.

Author: jinxing <[email protected]>

Closes apache#21019 from jinxing64/SPARK-23948.

(cherry picked from commit 3990daa)
@squito
Copy link
Contributor

squito commented Apr 17, 2018

a few minutes after merging this I realized I should have also merged to branch 2.3. I don't see a way to do that without another PR. oops. I opened this, its a clean cherry-pick #21085

@vanzin
Copy link
Contributor

vanzin commented Apr 17, 2018

I don't see a way to do that without another PR.

git cherry-pick -x -s && git push

@squito
Copy link
Contributor

squito commented Apr 17, 2018

I guess I rely entirely on the merge script, but in these simple cases I should just do the push directly ...

@jinxing64
Copy link
Author

@squito @jiangxb1987
Thanks for merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants