-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance #13234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. create new BlacklistTracker and BlacklistStrategy interface to support complex use case for blacklist mechanism. 2. make Yarn allocator aware of node blacklist information 3. three strategies implemented for convenience, also user can define his own strategy SingleTaskStrategy: remain default behavior before this change. AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting stage level node blacklist ExecutorAndNodeStrategy: different taskSet can share blacklist information.
1. fix compile error after rebase to latest codebas. 2. simplify configuration. 3. fix typo. 4. enhance comment and unit text. 5. remove unused import. 6. remove ExecutorAndNode strategy.
…n in SingleCoreMockBackend when killTask is unsupported
…n in SingleCoreMockBackend when killTask is unsupported
|
For the performance tests, I've collected data here: squito#5 (for lack of a better place). The brief summary here: |
|
Test build #59031 has finished for PR 13234 at commit
|
| // Rather than wasting time checking the offer against each task, and then realizing the | ||
| // executor is blacklisted, just filter out the bad executor immediately. | ||
| val nodeBlacklist = taskSet.blacklistTracker.map{_.nodeBlacklistForStage(taskSet.stageId)} | ||
| .getOrElse(Set()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change, there is an O(n^2) (where n is the number of pending tasks) cost when you've got one bad executor. The tasks assigned to the bad executor fail, but then we get another resource offer for the bad executor again. So we find another task for the bad executor, it fails, and we continue the process, going through all of the pending task. Each time we respond to the resource offer, we need to (a) iterate through the list of tasks to find one that is not blacklisted and (b) then remove it from the task list. Those are both O(1) operations when there isn't any blacklisting -- we just pop the last task off the stack. But as our bad executor makes its way through the tasks, it has to go deeper into the list each time, and both searching the list and then removing an element from it become expensive.
After we've gone through all of the tasks for bad executor once, then we will wait for there to be resource offers from good executors. However, even though we then start scheduling on the good executor, scheduling as a whole is still much slower, because we still have an O(n) cost at each call to resourceOffer. The offer still includes the (now idle) bad executor, and we have to iterate through the entire list of pending tasks to decide that nope, there aren't any tasks we can schedule on that node.
In my performance tests with a 3k task job, this leads to about a 10x slowdown, but obviously this depends a lot on the number of tasks. But that is the really scary thing -- its not a function of how many bad nodes you have, but how many tasks you are trying to run. So on a large cluster, where a bad node is more likely, and lots of tasks are more likely, the slowdown will be much worse.
Note that as implemented in this version of the patch, this slowdown is only avoided when we blacklist the entire node. But we should add blacklisting for an executor as well, to avoid the slowdown in that case also.
|
Test build #59344 has finished for PR 13234 at commit
|
|
(closing till this is in a better state to avoid triggering tests) |
What changes were proposed in this pull request?
Update of #8760 by @mwws. The current blacklist mechanism only considers one task a time -- this expands that by considering:
How was this patch tested?
Unit tests via jenkins.
Also I ran the additional tests proposed here which include blacklist tests.
TODO: