Skip to content

Conversation

@juanrh
Copy link

@juanrh juanrh commented Oct 27, 2017

What changes were proposed in this pull request?

I've been working on this issue, and I would like to get your feedback on the following approach. The idea is that instead of failing in TaskSetManager.abortIfCompletelyBlacklisted, when a task cannot be scheduled in any executor but dynamic allocation is enabled, we will register this task with ExecutorAllocationManager. Then ExecutorAllocationManager will request additional executors for these "unscheduleable tasks" by increasing the value returned in ExecutorAllocationManager.maxNumExecutorsNeeded. This way we are counting these tasks twice, but this makes sense because the current executors don't have any slot for these tasks, so we actually want to get new executors that are able to run these tasks. To avoid a deadlock due to tasks being unscheduleable forever, we store the timestamp when a task was registered as unscheduleable, and in ExecutorAllocationManager.schedule we abort the application if there is some task that has been unscheduleable for a configurable age threshold. This way we give an opportunity to dynamic allocation to get more executors that are able to run the tasks, but we don't make the application wait forever.

How was this patch tested?

This is WIP for discussion, unit tests will be provided later on

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@wangyum wangyum mentioned this pull request Nov 3, 2018
@asfgit asfgit closed this in 463a676 Nov 4, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#22859
Closes apache#22849
Closes apache#22591
Closes apache#22322
Closes apache#22312
Closes apache#19590

Closes apache#22934 from wangyum/CloseStalePRs.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants