-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23811][Core] FetchFailed comes before Success of same task will cause child stage never succeed #20930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
fbb53ed
cda7268
0defc09
df1768d
ba6f71a
a201764
7f8503f
fee903c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -82,11 +82,13 @@ private[spark] class TaskSetManager( | |
| val successful = new Array[Boolean](numTasks) | ||
| private val numFailures = new Array[Int](numTasks) | ||
|
|
||
| // Set the coresponding index of Boolean var when the task killed by other attempt tasks, | ||
| // Set the corresponding index of Boolean var when the task killed by other attempt tasks, | ||
| // this happened while we set the `spark.speculation` to true. The task killed by others | ||
| // should not resubmit while executor lost. | ||
| private val killedByOtherAttempt: Array[Boolean] = new Array[Boolean](numTasks) | ||
|
|
||
| private val fetchFailedTaskIndexSet = new HashSet[Int] | ||
|
|
||
| val taskAttempts = Array.fill[List[TaskInfo]](numTasks)(Nil) | ||
| private[scheduler] var tasksSuccessful = 0 | ||
|
|
||
|
|
@@ -750,6 +752,10 @@ private[spark] class TaskSetManager( | |
| if (tasksSuccessful == numTasks) { | ||
| isZombie = true | ||
| } | ||
| } else if (fetchFailedTaskIndexSet.contains(index)) { | ||
| logInfo("Ignoring task-finished event for " + info.id + " in stage " + taskSet.id + | ||
| " because task " + index + " has already failed by FetchFailed") | ||
| return | ||
|
||
| } else { | ||
| logInfo("Ignoring task-finished event for " + info.id + " in stage " + taskSet.id + | ||
| " because task " + index + " has already completed successfully") | ||
|
|
@@ -793,6 +799,7 @@ private[spark] class TaskSetManager( | |
| blacklistTracker.foreach(_.updateBlacklistForFetchFailure( | ||
| fetchFailed.bmAddress.host, fetchFailed.bmAddress.executorId)) | ||
| } | ||
| fetchFailedTaskIndexSet.add(index) | ||
|
|
||
| None | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you making this change? I don't quite get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change of
fetchFailedTaskIndexSetis to ignore the task success event if the stage is marked as failed, as Wenchen's suggestion in before comment.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should handle this case in DAGScheduler, then we can look up the stage by task id, and see if the stage is failed or not. Then we don't need
fetchFailedTaskIndexSet.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks for you two's guidance, that's more clear and the UT added for reproducing this problem can also used for checking it!