[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance #13234

squito · 2016-05-20T21:20:19Z

What changes were proposed in this pull request?

Update of #8760 by @mwws. The current blacklist mechanism only considers one task a time -- this expands that by considering:

When we determine an executor is bad, we blacklist all tasks from that blacklist, both within the taskset and subsequent task sets.
When many executors on a node appear to be bad, we blacklist the entire node.

How was this patch tested?

Unit tests via jenkins.
Also I ran the additional tests proposed here which include blacklist tests.

TODO:

performance tests
clear memory as task sets complete
a few more unit tests to write (see TODOs in code)
simplify some usage of the cache
add executor-level blacklisting
more internal comments (in particular on concurrency)
manual testing on a cluster

1. create new BlacklistTracker and BlacklistStrategy interface to support complex use case for blacklist mechanism. 2. make Yarn allocator aware of node blacklist information 3. three strategies implemented for convenience, also user can define his own strategy SingleTaskStrategy: remain default behavior before this change. AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting stage level node blacklist ExecutorAndNodeStrategy: different taskSet can share blacklist information.

1. fix compile error after rebase to latest codebas. 2. simplify configuration. 3. fix typo. 4. enhance comment and unit text. 5. remove unused import. 6. remove ExecutorAndNode strategy.

…n in SingleCoreMockBackend when killTask is unsupported

…e_tests

squito · 2016-05-20T22:09:47Z

For the performance tests, I've collected data here: squito#5 (for lack of a better place). The brief summary here: ~~the advanced strategy is indeed much slower, but I don't know why yet.~~ actually it was just an issue with the tests, its as fast in some cases, and significantly faster when there is a bad node.

SparkQA · 2016-05-20T23:24:12Z

Test build #59031 has finished for PR 13234 at commit f6bb6de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2016-05-26T06:49:36Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+    // Rather than wasting time checking the offer against each task, and then realizing the
+    // executor is blacklisted, just filter out the bad executor immediately.
+    val nodeBlacklist = taskSet.blacklistTracker.map{_.nodeBlacklistForStage(taskSet.stageId)}
+      .getOrElse(Set())


Before this change, there is an O(n^2) (where n is the number of pending tasks) cost when you've got one bad executor. The tasks assigned to the bad executor fail, but then we get another resource offer for the bad executor again. So we find another task for the bad executor, it fails, and we continue the process, going through all of the pending task. Each time we respond to the resource offer, we need to (a) iterate through the list of tasks to find one that is not blacklisted and (b) then remove it from the task list. Those are both O(1) operations when there isn't any blacklisting -- we just pop the last task off the stack. But as our bad executor makes its way through the tasks, it has to go deeper into the list each time, and both searching the list and then removing an element from it become expensive.

After we've gone through all of the tasks for bad executor once, then we will wait for there to be resource offers from good executors. However, even though we then start scheduling on the good executor, scheduling as a whole is still much slower, because we still have an O(n) cost at each call to resourceOffer. The offer still includes the (now idle) bad executor, and we have to iterate through the entire list of pending tasks to decide that nope, there aren't any tasks we can schedule on that node.

In my performance tests with a 3k task job, this leads to about a 10x slowdown, but obviously this depends a lot on the number of tasks. But that is the really scary thing -- its not a function of how many bad nodes you have, but how many tasks you are trying to run. So on a large cluster, where a bad node is more likely, and lots of tasks are more likely, the slowdown will be much worse.

Note that as implemented in this version of the patch, this slowdown is only avoided when we blacklist the entire node. But we should add blacklisting for an executor as well, to avoid the slowdown in that case also.

SparkQA · 2016-05-26T07:26:30Z

Test build #59344 has finished for PR 13234 at commit 8f2534b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2016-06-22T18:12:26Z

(closing till this is in a better state to avoid triggering tests)

wei-mao-intel and others added 30 commits May 10, 2016 12:18

change import order to meet new scala style check rule

51d3c88

simplify code and fix typo

7e52311

1. fix compile error after rebase to latest codebas. 2. simplify configuration. 3. fix typo. 4. enhance comment and unit text. 5. remove unused import. 6. remove ExecutorAndNode strategy.

style

b600604

small refactoring

45525a1

basic test framework for entire spark scheduler

a6e94d7

TaskResultGetter now expects there to always be non-null accum updates

20fb3e9

switch to making backend run in another thread

0ca9815

remove MultiExecutorBackend for now

421c2a1

remove uncertain comment about messageScheduler

c091187

cleanup

3b67b2a

add BlacklistIntegrationSuite and corresponding refactoring

79bc384

cleanup

8349b76

comments

7050b49

Merge branch 'SPARK-10372-scheduler-integs' into blacklist_w_performance

f400741

move dummy killTask to MockBackend, otherwise occasional problems eve…

0095376

…n in SingleCoreMockBackend when killTask is unsupported

move dummy killTask to MockBackend, otherwise occasional problems eve…

cb5860f

…n in SingleCoreMockBackend when killTask is unsupported

take advantage of ExternalClusteManager extension

8034995

cleanup

360c7cd

Merge branch 'SPARK-10372-scheduler-integs' into blacklist_w_performance

08b28c6

performance updates to mock backend + some utils

c7a78b0

add performance tests

ee59913

Merge branch 'master' into blacklist_w_performance

22705dc

Merge branch 'scheduler_performance_tests' into blacklist_w_performance

72d87ce

bug fix in mock scheduler

4fcbc1d

style

6ed19ae

simplification and comments

67acce9

Merge branch 'SPARK-10372-scheduler-integs' into scheduler_performanc…

0530a94

…e_tests

fix merge

17fcc9e

comments

b12b563

squito added 5 commits May 20, 2016 14:17

Merge branch 'scheduler_performance_tests' into blacklist_w_performance

930dbf7

Merge branch 'master' into blacklist-SPARK-8426

f6bb6de

Merge branch 'blacklist-SPARK-8426' into blacklist_w_performance

1de56d1

more tests

5d547f4

smaller demo of performance difference

d46c65d

squito added 16 commits May 23, 2016 09:10

labels

a394ab7

wip -- some instrumentation, easier repro of slowdown

f4609da

notes mostly

e852e0c

more notes

8b78d3f

fix race condition w/ runningTaskSets

883bfd7

updated logging

4358b2f

log executor in addition to host

f850a30

wip, logging and some logic updates

4ac99c6

performance suite updates

6f02ded

optimization -- skip blacklisted executors earlier in scheduling loop

71f1b47

bug fix -- update the right cache in nodeBlacklistForStage

ffd0f25

cleanup, TODOs

3effef6

process tasks in LIFO order for all performance tests, more cases, etc.

456f578

Merge branch 'master' into blacklist-SPARK-8426

794c804

Merge branch 'blacklist_w_performance' into blacklist-SPARK-8426

1f57963

remove performance suite from this branch

8f2534b

squito reviewed May 26, 2016
View reviewed changes

squito closed this Jun 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance #13234

[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance #13234

Uh oh!

squito commented May 20, 2016 •

edited

Loading

Uh oh!

squito commented May 20, 2016 •

edited

Loading

Uh oh!

SparkQA commented May 20, 2016

Uh oh!

squito May 26, 2016

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

squito commented Jun 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance #13234

[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance #13234

Uh oh!

Conversation

squito commented May 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

squito commented May 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 20, 2016

Uh oh!

squito May 26, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

squito commented Jun 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

squito commented May 20, 2016 •

edited

Loading

squito commented May 20, 2016 •

edited

Loading