-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10796][CORE]Resubmit stage while lost task in Zombie and removed TaskSetsAttempt #8927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I will run a test job on the latest code, to confirm that problem exist or not... |
|
Reproduce that, so re-open that |
|
Test build #43057 has finished for PR 8927 at commit
|
|
jenkins retest this please |
|
Test build #43059 has finished for PR 8927 at commit
|
|
Test build #43060 has finished for PR 8927 at commit
|
ce83c9b to
ff9ae61
Compare
|
jenkins retest this please |
|
Test build #56344 has finished for PR 8927 at commit
|
|
Test build #56345 has finished for PR 8927 at commit
|
fb478bb to
70af484
Compare
|
|
||
| outputCommitCoordinator.stageEnd(stage.id) | ||
| listenerBus.post(SparkListenerStageCompleted(stage.latestInfo)) | ||
| taskScheduler.zombieTasks(stage.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once stage was finished, it should make previous taskset Zombie
|
Test build #56691 has finished for PR 8927 at commit
|
| Success, | ||
| makeMapStatus("hostA", reduceRdd.partitions.size))) | ||
| assert(shuffleStage.numAvailableOutputs === 2) | ||
| assert(mapOutputTracker.getMapSizesByExecutorId(shuffleId, 0).map(_._1).toSet === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For running stage , executor lost will not register outputlocs in this PR
|
Test build #56692 has finished for PR 8927 at commit
|
|
Test build #58235 has finished for PR 8927 at commit
|
|
Test build #58236 has finished for PR 8927 at commit
|
|
@suyanNone, I think the conflicts should be resolved at the last once (as a mergeable state). Would you be able to resolve them? |
|
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.
We meet that problem in Spark 1.3.0, and I also reproduce on the latest version.
desc:
We know a running
ShuffleMapStagewill have multipleTaskSet: one Active TaskSet, multiple Zombie TaskSet, and mutiple removedTaskSetWe think a running
ShuffleMapStageis success only if its partition are all process success, namely each task‘s MapStatus are all add intooutputLocsMapStatus of running
ShuffleMapStagemay succeed by RemovedTaskSet1../Zombie TaskSet1 / Zombie TaskSet2 /..../ Active TaskSetN. So it had a chance that some output only hold by some RemovedTaskset or ZombieTaskSet.If lost a executor, it chanced that some lost-executor related MapStatus are succeed by some Zombie TaskSet.
In current logical, The solution to resolved that lost MapStatus problem is,
each TaskSet re-running that those tasks which succeed in lost-executor: re-add into
TaskSet's pendingTasks,and re-add it paritions into
Stage‘s pendingPartitions.but it is useless if that lost MapStatus only belong to Zombie/Removed TaskSet, it is Zombie, so will never be scheduled his
pendingTasksThe condition for resubmit stage is only if some task throws
FetchFailedException, but may the lost-executor just not empty any MapStatus of parent Stage for one of running Stages,and it‘s happen to that running
Stagewas lost a MapStatus only belong to a ZombieTask or removedTaskset.So if all Zombie TaskSets are all processed his runningTasks and Active TaskSet are all processed his pendingTask, then will removed by
TaskSchedulerImp, then that running Stage's pending partitions is still nonEmpty. it will hangs......TestCase to show problem:
main changes:
DAGScheuleronly receive Task Resubmit events from ActiveTaskSets, so it can comparependingPartitionswithShuffleMapStagemissing outputs to know whether there have some partition cannot compute according current Tasksets, and make a decision if there is a need to resubmitShuffleMapStageother changes: