Skip to content

Conversation

@liutang123
Copy link
Contributor

@liutang123 liutang123 commented Aug 23, 2018

What changes were proposed in this pull request?

In current DAGScheduler.handleTaskCompletion code, when a shuffleMapStage with job not in runningStages and its pendingPartitions is empty, the job of this shuffleMapStage will never complete.

Think about below

  1. Stage 0 runs and generates shuffle output data.

  2. Stage 1 reads the output from stage 0 and generates more shuffle data. It has two tasks with the same partition: ShuffleMapTask0 and ShuffleMapTask0.1(speculation).

  3. ShuffleMapTask0 fails to fetch blocks and sends a FetchFailed to the driver. The driver resubmits stage 0 and stage 1. The driver will place stage 0 in runningStages and place stage 1 in waitingStages.

  4. ShuffleMapTask0.1 successfully finishes and sends Success back to driver. The driver will add the mapstatus to the set of output locations of stage 1. because of stage 1 not in runningStages, the job will not complete.

  5. stage 0 completes and the driver will run stage 1. But, because the output sets of stage 1 is complete, the drive will not submit any tasks and make stage 1 complte right now. Because the job complete relay on the CompletionEvent and there will never a CompletionEvent come, the job will hang.

How was this patch tested?

UT

@Ngone51
Copy link
Member

Ngone51 commented Aug 23, 2018

Since stage 1 is only a ShuffleMapStage, so, why there're no other child stages to be submitted ?

assertDataStructuresEmpty()
}

test("Trigger mapstage's job listener in submitMissingTasks") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give some explain for deleting this test?

Copy link
Contributor Author

@liutang123 liutang123 Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that PR is conflict with this PR.
In that PR, shuffleMapStage waits the completion of parent stages's rerun.
In this PR, shuffleMapStage completes immediately when all partitions are ready.

@liutang123
Copy link
Contributor Author

@Ngone51 Because some shuffleMapStage has mapStageJobs(JobWaiter) by SparkContext.submitMapStage

@liutang123
Copy link
Contributor Author

@jinxing64 Do you have any idea?

@jinxing64
Copy link

Thanks for ping~
Seems that ShuffleMapTask0.1 is a speculation, please update the description.
The change seems fine for me. But give #21019, the issue in description is already solved. I think this change is a refine work for #21019. Fine for me. But we should always be careful when touching such core logic

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 8, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 8, 2020
@github-actions github-actions bot closed this Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants