[WIP][SPARK-22148][CORE] TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled #19590
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
I've been working on this issue, and I would like to get your feedback on the following approach. The idea is that instead of failing in
TaskSetManager.abortIfCompletelyBlacklisted, when a task cannot be scheduled in any executor but dynamic allocation is enabled, we will register this task withExecutorAllocationManager. ThenExecutorAllocationManagerwill request additional executors for these "unscheduleable tasks" by increasing the value returned inExecutorAllocationManager.maxNumExecutorsNeeded. This way we are counting these tasks twice, but this makes sense because the current executors don't have any slot for these tasks, so we actually want to get new executors that are able to run these tasks. To avoid a deadlock due to tasks being unscheduleable forever, we store the timestamp when a task was registered as unscheduleable, and inExecutorAllocationManager.schedulewe abort the application if there is some task that has been unscheduleable for a configurable age threshold. This way we give an opportunity to dynamic allocation to get more executors that are able to run the tasks, but we don't make the application wait forever.How was this patch tested?
This is WIP for discussion, unit tests will be provided later on