-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23811][Core] FetchFailed comes before Success of same task will cause child stage never succeed #20930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
fbb53ed
cda7268
0defc09
df1768d
ba6f71a
a201764
7f8503f
fee903c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2399,6 +2399,50 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi | |
| } | ||
| } | ||
|
|
||
| /** | ||
| * This tests the case where origin task success after speculative task got FetchFailed | ||
| * before. | ||
| */ | ||
| test("[SPARK-23811] FetchFailed comes before Success of same task will cause child stage" + | ||
| " never succeed") { | ||
|
||
| // Create 3 RDDs with shuffle dependencies on each other: rddA <--- rddB <--- rddC | ||
| val rddA = new MyRDD(sc, 2, Nil) | ||
| val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) | ||
| val shuffleIdA = shuffleDepA.shuffleId | ||
|
|
||
| val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) | ||
| val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) | ||
|
|
||
| val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) | ||
|
|
||
| submit(rddC, Array(0, 1)) | ||
|
|
||
| // Complete both tasks in rddA. | ||
| assert(taskSets(0).stageId === 0 && taskSets(0).stageAttemptId === 0) | ||
| complete(taskSets(0), Seq( | ||
| (Success, makeMapStatus("hostA", 2)), | ||
| (Success, makeMapStatus("hostB", 2)))) | ||
|
|
||
| // The first task success | ||
| runEvent(makeCompletionEvent( | ||
| taskSets(1).tasks(0), Success, makeMapStatus("hostB", 2))) | ||
|
|
||
| // The second task's speculative attempt fails first, but task self still running. | ||
| // This may caused by ExecutorLost. | ||
| runEvent(makeCompletionEvent( | ||
| taskSets(1).tasks(1), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I'm not very familiar with this test suite, how can you tell it's a speculative task?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we only need to mock the speculative task failed event came before success event,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe, you can |
||
| FetchFailed(makeBlockManagerId("hostA"), shuffleIdA, 0, 0, "ignored"), | ||
| null)) | ||
| // Check currently missing partition. | ||
| assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size === 1) | ||
| // The second result task self success soon. | ||
| runEvent(makeCompletionEvent( | ||
| taskSets(1).tasks(1), Success, makeMapStatus("hostB", 2))) | ||
| // Missing partition number should not change, otherwise it will cause child stage | ||
| // never succeed. | ||
| assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size === 1) | ||
| } | ||
|
|
||
| /** | ||
| * Assert that the supplied TaskSet has exactly the given hosts as its preferred locations. | ||
| * Note that this checks only the host and not the executor ID. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we only have a problem with shuffle map task not result task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also confuse me before, as far as I'm concerned, the result task in such scenario(speculative task fail but original task success) is ok because it has no child stage, we can use the success task's result and
markStageAsFinished. But for shuffle map task, it will cause inconformity between mapOutputTracker and stage's pendingPartitions, it must fix.I'm not sure of ResultTask's behavior, can you give some advice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I may nitpick kere. Can you simulate what happens to result task if FechFaileded comes before task success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we may mistakenly mark a job as finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's necessary, I should have to make sure about this, thanks for your advice! :)
Sure, but it maybe hardly to reproduce this in real env, I'll try to fake it on UT first ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the UT for simulating this scenario happens to result task.