-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25211][Core] speculation and fetch failed result in hang of job #22202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Since |
| assertDataStructuresEmpty() | ||
| } | ||
|
|
||
| test("Trigger mapstage's job listener in submitMissingTasks") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give some explain for deleting this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that PR is conflict with this PR.
In that PR, shuffleMapStage waits the completion of parent stages's rerun.
In this PR, shuffleMapStage completes immediately when all partitions are ready.
|
@Ngone51 Because some shuffleMapStage has mapStageJobs(JobWaiter) by |
|
@jinxing64 Do you have any idea? |
|
Thanks for ping~ |
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
In current
DAGScheduler.handleTaskCompletioncode, when a shuffleMapStage with job not in runningStages and itspendingPartitionsis empty, the job of this shuffleMapStage will never complete.Think about below
Stage 0 runs and generates shuffle output data.
Stage 1 reads the output from stage 0 and generates more shuffle data. It has two tasks with the same partition: ShuffleMapTask0 and ShuffleMapTask0.1(speculation).
ShuffleMapTask0 fails to fetch blocks and sends a FetchFailed to the driver. The driver resubmits stage 0 and stage 1. The driver will place stage 0 in runningStages and place stage 1 in waitingStages.
ShuffleMapTask0.1 successfully finishes and sends Success back to driver. The driver will add the mapstatus to the set of output locations of stage 1. because of stage 1 not in runningStages, the job will not complete.
stage 0 completes and the driver will run stage 1. But, because the output sets of stage 1 is complete, the drive will not submit any tasks and make stage 1 complte right now. Because the job complete relay on the
CompletionEventand there will never aCompletionEventcome, the job will hang.How was this patch tested?
UT