[SPARK-8167] Tasks that fail from YARN preemption should not fail job #13

mccheah · 2015-07-09T18:05:36Z

Previously when tasks failed on executors that were preempted, those tasks counted as failed tasks. If the number of failed tasks reached spark.task.maxFailures then the entire stage would abort, likely causing the active job to crash. However preemption is a normal occurrence and tasks that fail due to preemption should not fail the job.

The solution is to find the reason why executors disconnect when they do. Before, when executors disconnect, the onDisconnected event is fired to the Spark driver but the driver just removes the executor without first finding out why the executor was terminated. Now, in YARN mode specifically, the driver queries the application master to find out why the executor was terminated before forwarding that information to the Scheduler component. The TaskSetManager takes the reason why the executor was lost into consideration when determining if the job should be failed or not.

Reviewers requested: @mingyukim @punya @ash211 - this PR is only for review. It's a big one so I'd like to get it reviewed internally before submitting to upstream for review.

The architecture is that, in YARN mode, if the driver detects that an executor has disconnected, it asks the ApplicationMaster why the executor died. If the ApplicationMaster is aware that the executor died because of preemption, all tasks associated with that executor are not marked as failed. The executor is still removed from the driver's list of available executors, however.

…happen immediately. Also begin unit tests.

mccheah · 2015-07-09T18:07:39Z

core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala

Not sure how I feel about creating the subclass of DriverEndpoint here.

The architecture basically only enforces changing the behavior in YARN mode. One could conceivably however want to do something similar in standalone mode, e.g. ask the Spark master why an executor terminated. But to be safe and to minimize the places this change impacts I tried to localize everything to just YARN mode.

…" into true or false directly SQL ``` select key from src where 3 in (4, 5); ``` Before ``` == Optimized Logical Plan == Project [key#12] Filter 3 INSET (5,4) MetastoreRelation default, src, None ``` After ``` == Optimized Logical Plan == LocalRelation [key#228], [] ``` Author: Zhongshuai Pei <[email protected]> Author: DoingDone9 <[email protected]> Closes apache#5972 from DoingDone9/InToFalse and squashes the following commits: 4c722a2 [Zhongshuai Pei] Update predicates.scala abe2bbb [Zhongshuai Pei] Update Optimizer.scala fa461a5 [Zhongshuai Pei] Update Optimizer.scala e34c28a [Zhongshuai Pei] Update predicates.scala 24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala 35ceb7a [Zhongshuai Pei] Update Optimizer.scala 36c194e [Zhongshuai Pei] Update Optimizer.scala 2e8f6ca [Zhongshuai Pei] Update Optimizer.scala 14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master 34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master 802261c [DoingDone9] Merge pull request #7 from apache/master d00303b [DoingDone9] Merge pull request #6 from apache/master 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master (cherry picked from commit 4b5e1fe) Signed-off-by: Michael Armbrust <[email protected]>

mccheah · 2015-07-28T02:45:20Z

Ping?

mingyukim · 2015-07-28T20:25:34Z

core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala

Do you need to keep this hashset? Is it possible that the same executors appear twice in the onDisconnected() callback?

Happened to me in local testing, but it's still not clear why. Might be a weird Akka thing.

mccheah · 2015-08-04T18:56:25Z

Ok I'll just submit this PR to upstream

Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala

Also I changed the semantics of YarnAllocator.getExecutorLossReason() to better "clean up" the completed executor map. Also processCompletedContainers() now always adds an executor reason to the completed executor exit reason map regardless of exit status as it is expected for the client to always call getExecutorLossReason().

mccheah · 2015-08-05T17:53:20Z

I lied. Master's updates are making me rethink this implementation and what needs to change.

mccheah · 2015-08-07T00:32:42Z

PR is made against upstream instead

mccheah added 4 commits July 9, 2015 11:05

Handle multiple onDisconnected events since executor removal doesn't …

1100ed8

…happen immediately. Also begin unit tests.

Completing unit tests.

08d2bdd

Removing rogue .orig file

62e4342

mccheah reviewed Jul 9, 2015
View reviewed changes

mccheah added 3 commits July 9, 2015 11:17

Put back extends CoarseGrainedClusterMessage

bd1056e

Removing unnecessary import

2bbdcfd

Organizing more imports

a697c7c

mccheah mentioned this pull request Jul 15, 2015

Add logic to decide when to do key/value pre-aggregation on Map side. #15

Closed

mingyukim reviewed Jul 28, 2015
View reviewed changes

mccheah added 2 commits August 4, 2015 12:11

Adding a stronger assertion, and fixing compiler error

aa69b6f

mccheah closed this Aug 7, 2015

Fix all the scalastyle errors

e3e827f

ash211 pushed a commit that referenced this pull request Feb 16, 2017

inherit IO (#13)

f748d3b

mccheah pushed a commit that referenced this pull request Apr 27, 2017

inherit IO (#13)

24f4bf0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8167] Tasks that fail from YARN preemption should not fail job #13

[SPARK-8167] Tasks that fail from YARN preemption should not fail job #13

Uh oh!

mccheah commented Jul 9, 2015

Uh oh!

mccheah Jul 9, 2015

Uh oh!

mccheah commented Jul 28, 2015

Uh oh!

mingyukim Jul 28, 2015

Uh oh!

mccheah Jul 28, 2015

Uh oh!

mingyukim Jul 28, 2015

Uh oh!

mccheah commented Aug 4, 2015

Uh oh!

mccheah commented Aug 5, 2015

Uh oh!

mccheah commented Aug 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-8167] Tasks that fail from YARN preemption should not fail job #13

[SPARK-8167] Tasks that fail from YARN preemption should not fail job #13

Uh oh!

Conversation

mccheah commented Jul 9, 2015

Uh oh!

mccheah Jul 9, 2015

Choose a reason for hiding this comment

Uh oh!

mccheah commented Jul 28, 2015

Uh oh!

mingyukim Jul 28, 2015

Choose a reason for hiding this comment

Uh oh!

mccheah Jul 28, 2015

Choose a reason for hiding this comment

Uh oh!

mingyukim Jul 28, 2015

Choose a reason for hiding this comment

Uh oh!

mccheah commented Aug 4, 2015

Uh oh!

mccheah commented Aug 5, 2015

Uh oh!

mccheah commented Aug 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants