[SPARK-10796][CORE]Resubmit stage while lost task in Zombie and removed TaskSetsAttempt #8927

suyanNone · 2015-09-28T03:23:29Z

We meet that problem in Spark 1.3.0, and I also reproduce on the latest version.

desc:

We know a running ShuffleMapStage will have multiple TaskSet: one Active TaskSet, multiple Zombie TaskSet, and mutiple removedTaskSet
We think a running ShuffleMapStage is success only if its partition are all process success, namely each task‘s MapStatus are all add into outputLocs
MapStatus of running ShuffleMapStage may succeed by RemovedTaskSet1../Zombie TaskSet1 / Zombie TaskSet2 /..../ Active TaskSetN. So it had a chance that some output only hold by some RemovedTaskset or ZombieTaskSet.
If lost a executor, it chanced that some lost-executor related MapStatus are succeed by some Zombie TaskSet.

In current logical, The solution to resolved that lost MapStatus problem is,
each TaskSet re-running that those tasks which succeed in lost-executor: re-add into TaskSet's pendingTasks,
and re-add it paritions into Stage‘s pendingPartitions .
but it is useless if that lost MapStatus only belong to Zombie/Removed TaskSet, it is Zombie, so will never be scheduled his pendingTasks
The condition for resubmit stage is only if some task throws FetchFailedException, but may the lost-executor just not empty any MapStatus of parent Stage for one of running Stages,
and it‘s happen to that running Stage was lost a MapStatus only belong to a ZombieTask or removedTaskset.
So if all Zombie TaskSets are all processed his runningTasks and Active TaskSet are all processed his pendingTask, then will removed by TaskSchedulerImp, then that running Stage's pending partitions is still nonEmpty. it will hangs......

TestCase to show problem:

 test("Resubmit stage while lost partition in ZombieTasksets or RemovedTaskSets") {
    val firstRDD = new MyRDD(sc, 3, Nil)
    val firstShuffleDep = new ShuffleDependency(firstRDD, new HashPartitioner(3))
    val firstShuffleId = firstShuffleDep.shuffleId
    val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep))
    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
    val reduceRdd = new MyRDD(sc, 1, List(shuffleDep))
    submit(reduceRdd, Array(0))

    // things start out smoothly, stage 0 completes with no issues
    complete(taskSets(0), Seq(
      (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length)),
      (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length)),
      (Success, makeMapStatus("hostA", shuffleMapRdd.partitions.length))
    ))

    // then start running stage 1
    runEvent(makeCompletionEvent(
      taskSets(1).tasks(0),
      Success,
      makeMapStatus("hostD", shuffleMapRdd.partitions.length)))

    // simulate make stage 1 resubmit, notice for stage1.0
    // partitionId=1 already finished in hostD, so if we resubmit stage1,
    // stage 1.1 only resubmit tasks for partitionId = 0,2
    runEvent(makeCompletionEvent(
      taskSets(1).tasks(1),
      FetchFailed(null, firstShuffleId, 2, 1, "Fetch failed"), null))
    scheduler.resubmitFailedStages()

    val stage1Resubmit1 = taskSets(2)
    assert(stage1Resubmit1.stageId == 1)
    assert(stage1Resubmit1.tasks.size == 2)

    // now exec-hostD lost, so the output loc of stage1 partitionId=1 will lost.
    runEvent(ExecutorLost("exec-hostD"))
    runEvent(makeCompletionEvent(taskSets(1).tasks(0), Resubmitted, null))

    // let stage1Resubmit1 complete
    complete(taskSets(2), Seq(
      (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length)),
      (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length))
    ))

    // and let we complete tasksets1.0's active running Tasks
    runEvent(makeCompletionEvent(
      taskSets(1).tasks(1),
      Success,
      makeMapStatus("hostD", shuffleMapRdd.partitions.length)))

    runEvent(makeCompletionEvent(
      taskSets(1).tasks(2),
      Success,
      makeMapStatus("hostD", shuffleMapRdd.partitions.length)))

    // Now all runningTasksets for stage1 was all completed. 
    assert(scheduler.runningStages.head.pendingPartitions.head == 0)

main changes:

make DAGScheuler only receive Task Resubmit events from ActiveTaskSets, so it can compare pendingPartitions with ShuffleMapStage missing outputs to know whether there have some partition cannot compute according current Tasksets, and make a decision if there is a need to resubmit ShuffleMapStage

other changes:

not register running stage outputlocs while executor lost
Make stage's tasksets as zombie while marked as finished
ignore expired task's partition output loc
add make taskSetManager not handle failedTask again, which task already mark as failed due to executor lost

suyanNone · 2015-09-28T03:34:16Z

I will run a test job on the latest code, to confirm that problem exist or not...

suyanNone · 2015-09-28T09:59:28Z

Reproduce that, so re-open that

SparkQA · 2015-09-28T10:07:54Z

Test build #43057 has finished for PR 8927 at commit 301da0a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

suyanNone · 2015-09-28T11:50:54Z

jenkins retest this please

SparkQA · 2015-09-28T12:00:56Z

Test build #43059 has finished for PR 8927 at commit 301da0a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-28T15:18:58Z

Test build #43060 has finished for PR 8927 at commit ce83c9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

suyanNone · 2016-04-20T08:27:18Z

jenkins retest this please

SparkQA · 2016-04-20T08:28:59Z

Test build #56344 has finished for PR 8927 at commit 1be6071.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T08:53:25Z

Test build #56345 has finished for PR 8927 at commit fb478bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

suyanNone · 2016-04-22T09:48:34Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala


    outputCommitCoordinator.stageEnd(stage.id)
    listenerBus.post(SparkListenerStageCompleted(stage.latestInfo))
+    taskScheduler.zombieTasks(stage.id)


Once stage was finished, it should make previous taskset Zombie

SparkQA · 2016-04-22T09:49:48Z

Test build #56691 has finished for PR 8927 at commit 70af484.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

suyanNone · 2016-04-22T09:54:54Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

      Success,
      makeMapStatus("hostA", reduceRdd.partitions.size)))
    assert(shuffleStage.numAvailableOutputs === 2)
-    assert(mapOutputTracker.getMapSizesByExecutorId(shuffleId, 0).map(_._1).toSet ===


For running stage , executor lost will not register outputlocs in this PR

suyanNone · 2016-04-22T09:55:52Z

@squito @srowen

SparkQA · 2016-04-22T11:36:51Z

Test build #56692 has finished for PR 8927 at commit 259698e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-10T15:22:12Z

Test build #58235 has finished for PR 8927 at commit 743a1e6.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-05-10T15:39:30Z

Test build #58236 has finished for PR 8927 at commit 3bf1eaa.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-06-19T03:45:44Z

@suyanNone, I think the conflicts should be resolved at the last once (as a mergeable state). Would you be able to resolve them?

gatorsmile · 2017-06-27T06:39:18Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.

suyanNone closed this Sep 28, 2015

suyanNone reopened this Sep 28, 2015

suyanNone closed this Apr 20, 2016

suyanNone force-pushed the rerun-special branch from ce83c9b to ff9ae61 Compare April 20, 2016 08:18

suyanNone reopened this Apr 20, 2016

suyanNone mentioned this pull request Apr 22, 2016

[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524

Closed

suyanNone changed the title ~~[SPARK-10796][CORE]Resubmit stage while lost task in Zombie TaskSets~~ [SPARK-10796][CORE]Resubmit stage while lost task in Zombie and removed TaskSetsAttempt Apr 22, 2016

suyanNone force-pushed the rerun-special branch from fb478bb to 70af484 Compare April 22, 2016 09:44

suyanNone reviewed Apr 22, 2016
View reviewed changes

refine case

743a1e6

suyanNone force-pushed the rerun-special branch from 259698e to 743a1e6 Compare May 10, 2016 13:06

add another related fix patch

3bf1eaa

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

[SPARK-10796][CORE]Resubmit stage while lost task in Zombie and removed TaskSetsAttempt #8927

[SPARK-10796][CORE]Resubmit stage while lost task in Zombie and removed TaskSetsAttempt #8927

Uh oh!

Conversation

suyanNone commented Sep 28, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suyanNone commented Sep 28, 2015

Uh oh!

suyanNone commented Sep 28, 2015

Uh oh!

SparkQA commented Sep 28, 2015

Uh oh!

suyanNone commented Sep 28, 2015

Uh oh!

SparkQA commented Sep 28, 2015

Uh oh!

SparkQA commented Sep 28, 2015

Uh oh!

suyanNone commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

suyanNone Apr 22, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

suyanNone Apr 22, 2016

Choose a reason for hiding this comment

Uh oh!

suyanNone commented Apr 22, 2016

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

HyukjinKwon commented Jun 19, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

suyanNone commented Sep 28, 2015 •

edited

Loading