Skip to content

Conversation

@squito
Copy link
Contributor

@squito squito commented May 20, 2016

What changes were proposed in this pull request?

Update of #8760 by @mwws. The current blacklist mechanism only considers one task a time -- this expands that by considering:

  1. When we determine an executor is bad, we blacklist all tasks from that blacklist, both within the taskset and subsequent task sets.
  2. When many executors on a node appear to be bad, we blacklist the entire node.

How was this patch tested?

Unit tests via jenkins.
Also I ran the additional tests proposed here which include blacklist tests.

TODO:

  • performance tests
  • clear memory as task sets complete
  • a few more unit tests to write (see TODOs in code)
  • simplify some usage of the cache
  • add executor-level blacklisting
  • more internal comments (in particular on concurrency)
  • manual testing on a cluster

wei-mao-intel and others added 30 commits May 10, 2016 12:18
1. create new BlacklistTracker and BlacklistStrategy interface to
support
complex use case for blacklist mechanism.
2. make Yarn allocator aware of node blacklist information
3. three strategies implemented for convenience, also user can define
his own strategy
SingleTaskStrategy: remain default behavior before this change.
AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting
stage level node blacklist
ExecutorAndNodeStrategy: different taskSet can share blacklist
information.
1. fix compile error after rebase to latest codebas.
2. simplify configuration.
3. fix typo.
4. enhance comment and unit text.
5. remove unused import.
6. remove ExecutorAndNode strategy.
…n in SingleCoreMockBackend when killTask is unsupported
…n in SingleCoreMockBackend when killTask is unsupported
@squito
Copy link
Contributor Author

squito commented May 20, 2016

For the performance tests, I've collected data here: squito#5 (for lack of a better place). The brief summary here: the advanced strategy is indeed much slower, but I don't know why yet. actually it was just an issue with the tests, its as fast in some cases, and significantly faster when there is a bad node.

@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #59031 has finished for PR 13234 at commit f6bb6de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Rather than wasting time checking the offer against each task, and then realizing the
// executor is blacklisted, just filter out the bad executor immediately.
val nodeBlacklist = taskSet.blacklistTracker.map{_.nodeBlacklistForStage(taskSet.stageId)}
.getOrElse(Set())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, there is an O(n^2) (where n is the number of pending tasks) cost when you've got one bad executor. The tasks assigned to the bad executor fail, but then we get another resource offer for the bad executor again. So we find another task for the bad executor, it fails, and we continue the process, going through all of the pending task. Each time we respond to the resource offer, we need to (a) iterate through the list of tasks to find one that is not blacklisted and (b) then remove it from the task list. Those are both O(1) operations when there isn't any blacklisting -- we just pop the last task off the stack. But as our bad executor makes its way through the tasks, it has to go deeper into the list each time, and both searching the list and then removing an element from it become expensive.

After we've gone through all of the tasks for bad executor once, then we will wait for there to be resource offers from good executors. However, even though we then start scheduling on the good executor, scheduling as a whole is still much slower, because we still have an O(n) cost at each call to resourceOffer. The offer still includes the (now idle) bad executor, and we have to iterate through the entire list of pending tasks to decide that nope, there aren't any tasks we can schedule on that node.

In my performance tests with a 3k task job, this leads to about a 10x slowdown, but obviously this depends a lot on the number of tasks. But that is the really scary thing -- its not a function of how many bad nodes you have, but how many tasks you are trying to run. So on a large cluster, where a bad node is more likely, and lots of tasks are more likely, the slowdown will be much worse.

Note that as implemented in this version of the patch, this slowdown is only avoided when we blacklist the entire node. But we should add blacklisting for an executor as well, to avoid the slowdown in that case also.

@SparkQA
Copy link

SparkQA commented May 26, 2016

Test build #59344 has finished for PR 13234 at commit 8f2534b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Jun 22, 2016

(closing till this is in a better state to avoid triggering tests)

@squito squito closed this Jun 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants