[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794

zhonghaihua · 2016-01-17T12:55:53Z

Currently, when max number of executor failures reached the maxNumExecutorFailures, ApplicationMaster will be killed and re-register another one.This time, YarnAllocator will be created a new instance.
But, the value of property executorIdCounter in YarnAllocator will reset to 0. Then the Id of new executor will starting from 1. This will confuse with the executor has already created before, which will cause FetchFailedException.
This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, link to jira issues SPARK-12864
This PR introduce a mechanism to initialize executorIdCounter after ApplicationMaster killed.

…umber of executor failures reached

zhonghaihua · 2016-01-17T13:08:58Z

cc @rxin @marmbrus @chenghao-intel @jeanlyn could you give some advice ？

zhonghaihua · 2016-01-26T05:03:54Z

@marmbrus @liancheng @yhuai Could you verify this patch?

andrewor14 · 2016-02-01T22:02:59Z

@vanzin @jerryshao IIRC there's a similar patch somewhere to fix this issue?

jerryshao · 2016-02-02T00:52:33Z

Yes, this is the yarn-client only AM reattempt issue, I address this issue before by resetting the status of ExecutorAllocationManager and CoarseGrainedSchedulerBackend. But looks like there's still some stale state conflicts in BlockManager. For the details you could check the related JIRA.

andrewor14 · 2016-02-02T02:02:58Z

ok to test

SparkQA · 2016-02-02T04:16:29Z

Test build #50524 has finished for PR 10794 at commit 30048ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhonghaihua · 2016-02-04T16:15:59Z

@andrewor14 Thanks for review it. Could this path merge to master ?

zhonghaihua · 2016-02-18T09:29:30Z

@andrewor14 @marmbrus @rxin , any thoughts or concerns for this patch ?

andrewor14 · 2016-02-19T01:59:49Z

@zhonghaihua @jerryshao How is this related to #11205?

jerryshao · 2016-02-19T02:12:31Z

@andrewor14 from my understanding I don't think it is the same issue.

zhonghaihua · 2016-02-19T02:18:28Z

Hi, @andrewor14 , I agree with @jerryshao , I think that is not related to it.

lianhuiwang · 2016-02-19T07:25:04Z

@jerryshao I think it needs to reset in CoarseGrainedSchedulerBackend when dynamicAllocation is not enabled. because it can clear information and come back to initialization state.

jerryshao · 2016-02-19T07:32:11Z

So you mean we also need to clean the states in CoarseGrainedSchedulerBackend if AM failure occurs, even dynamic allocation is not enabled?

What specific behavior did you see? @lianhuiwang

lianhuiwang · 2016-02-19T09:23:29Z

@jerryshao when yarn-client, because driver is always running, when AM failure occurs, some executors that is created by previous AM may still exist after second AM start. so I think we need to reset in CoarseGrainedSchedulerBackend that can remove all historical executors before second AM allocate new executors.

jerryshao · 2016-02-19T09:30:51Z

I see, that's what I worried about. I thought about this potential issue previously, also conflict executor id may bring in race conditions. Let me think about a proper way to address it.

andrewor14 · 2016-02-19T21:34:54Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

+        Utils.localHostName,
+        port,
+        sparkConf,
+        new SecurityManager(sparkConf))


why create another endpoint here? Can't we just use driverRef?

Hi @andrewor14 , driverRef doesn't work in this case. Because, for my understanding, driverRef which endpoint name called YarnScheduler send message to YarnSchedulerEndpoint (or get message from YarnSchedulerEndpoint), while we should get max executorId from CoarseGrainedSchedulerBackend.DriverEndpoint which endpoint name called CoarseGrainedScheduler.

So, I think we should need a method to initialize executorIdCounter. And as you said, we should add huge comment huge comment related to SPARK-12864 to explain why we need to do this at this method. What‘s your opinion ?

YarnSchedulerBackend extends CoarseGrainedSchedulerBackend, so what you mentioned can be achieved, you can check other codes inside the class to know how other codes handle this. Creating another endpoint is not necessary and weird here.

Hi @jerryshao , thanks for your comments. I see what you mean, I will fix it soon. Thanks a lot.

andrewor14 · 2016-02-19T21:52:02Z

@jerryshao Can you clarify something: even after your fix in #9963 you still run into this issue right? Dynamic allocation or not, how did Spark ever continue to work across AM restarts?

jerryshao · 2016-02-20T03:08:45Z

Hi @andrewor14 , in our implementation, currently when AM is failed all the related executors will be exited automatically, and driver will be notified with disconnection events and remove the related states. After then when the AM restarts, new executors will be registered into driver.

Here we assume all the executors will be exited before AM restarts. I'm afraid AM will possibly be restarted before all the executors are exited. To try to fix this, here in #9963 I cleaned executorDataMap when reset is invoked, but it is only for dynamic allocation enabled situation. like what @lianhuiwang mentioned, for dynamic allocation disabled situation we should also clean this state.

Beside, what I'm thinking is that there might be conflicted executor id issue, since executor id will be recalculated when AM restarts, which will be conflicted with old one. The issue may not only be in the driver side, but also in the external shuffle service (since now executor shuffle service requires executor id to do some recovery works), but I haven't yet met such issue till now.

… and add some comment

zhonghaihua · 2016-02-22T03:54:31Z

Hi @andrewor14 , the reason of test failed seems GitException. Could you retest it ? Thanks a lot.

tgravescs · 2016-02-22T13:56:46Z

So we never intended to support the AM restart in client mode and having the driver handle that properly. I was expecting it to see the AM die and the driver to go away. At one point the AM attempts was set to 1 and I think we just never handled it when we changed it to be configurable.

We probably either need to test it out fully or just set the attempts to 1 for client mode.

jerryshao · 2016-02-25T07:30:30Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

  protected val executorsPendingLossReason = new HashSet[String]

+  // The num of current max ExecutorId used to re-register appMaster
+  var currentExecutorIdCounter = 0


Please add scope keyword protected

Hi @jerryshao , thanks for your comments. The master branch is different from branch-1.5.x version. In master branch,CoarseGrainedSchedulerBackend is belong to module core and YarnSchedulerBackend is belong to module yarn , while in branch-1.5.x version it is belong to the same package. So, from my understanding, protected is unsuited here, right?

I don't think so. Though they're in different modules, still they're under same package, please see other variables like hostToLocalTaskCount.

Hi @jerryshao , you are right. I fix it now. Thanks a lot.

SparkQA · 2016-02-25T10:32:31Z

Test build #51938 has finished for PR 10794 at commit 3a1724c.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

zhonghaihua · 2016-02-25T11:05:23Z

retest this please.

SparkQA · 2016-02-25T16:00:54Z

Test build #51971 has finished for PR 10794 at commit 659c505.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhonghaihua · 2016-02-26T03:27:06Z

Hi @andrewor14 , could you review this ? Thanks a lot.

zhonghaihua · 2016-03-03T16:11:13Z

Hi @andrewor14 , any thoughts or concerns for this patch ?

zhonghaihua · 2016-03-26T16:20:44Z

@andrewor14 @tgravescs @vanzin Could you verify this PR, or any thoughts or concerns for this ? Thanks a lot.

andrewor14 · 2016-03-29T00:26:11Z

This looks OK. Any thoughts @vanzin @tgravescs?

tgravescs · 2016-03-29T14:13:43Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

+   * Used to generate a unique ID per executor
+   *
+   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
+   * the id of new executor will start from 1, this will conflict with the executor has


I think we need to clarify this to say this is required for client mode when driver isn't running on yarn. this isn't an issue in cluster mode.

@tgravescs I think we can clarify this in SPARK-12864 issue. @andrewor14 What's your opinion ?

I would prefer to do it here in this comment to better describe the situation this is needed. It should be a line or two and I personally much prefer that then pointing at jiras unless its a big discussion/background required, then the jira makes more sense.

@tgravescs Ok, I will do it soon. Thanks a lot.

tgravescs · 2016-03-30T13:21:01Z

Also please update the description of this PR and jira to describe that this is happening in client mode due to driver not running on yarn.

SparkQA · 2016-04-01T07:00:32Z

Test build #54685 has finished for PR 10794 at commit ebe3c7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhonghaihua · 2016-04-01T07:26:56Z

@andrewor14 @tgravescs @vanzin The code and the comment is optimized. And the description of this PR and jira is also updated. Please review it again. Thanks a lot.

vanzin · 2016-04-01T21:19:04Z

LGTM, I'll leave it to @tgravescs to do a final review.

tgravescs · 2016-04-01T21:21:19Z

+1

tgravescs · 2016-04-01T21:26:56Z

@ zhonghaihua what is your jira id so I can assign it to you?

zhonghaihua · 2016-04-02T13:16:08Z

Hi, @tgravescs , my jira id is Iward. Thanks a lot.

…onMaster killed for max num executor failures apache#10794

initialize executorIdCounter after ApplicationMaster killed for max n…

30048ac

…umber of executor failures reached

zhonghaihua changed the title ~~initialize executorIdCounter after ApplicationMaster killed for max n…~~ [SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… Jan 17, 2016

andrewor14 reviewed Feb 19, 2016
View reviewed changes

change name RetrieveCurrentExecutorIdCounter to RetrieveMaxExecutorId…

fa7d54b

… and add some comment

change the way of getting max executorId from driver

3a1724c

jerryshao reviewed Feb 25, 2016
View reviewed changes

change variables currentExecutorIdCounter to protected

659c505

tgravescs reviewed Mar 29, 2016
View reviewed changes

optimize and add comment annotation

ebe3c7f

asfgit closed this in bd7b91c Apr 1, 2016

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Apr 6, 2016

[EXT][SPARK-12864][YARN] initialize executorIdCounter after Applicati…

e736da6

…onMaster killed for max num executor failures apache#10794

[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794

[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794

Uh oh!

Conversation

zhonghaihua commented Jan 17, 2016

Uh oh!

zhonghaihua commented Jan 17, 2016

Uh oh!

zhonghaihua commented Jan 26, 2016

Uh oh!

andrewor14 commented Feb 1, 2016

Uh oh!

jerryshao commented Feb 2, 2016

Uh oh!

andrewor14 commented Feb 2, 2016

Uh oh!

SparkQA commented Feb 2, 2016

Uh oh!

zhonghaihua commented Feb 4, 2016

Uh oh!

zhonghaihua commented Feb 18, 2016

Uh oh!

andrewor14 commented Feb 19, 2016

Uh oh!

jerryshao commented Feb 19, 2016

Uh oh!

zhonghaihua commented Feb 19, 2016

Uh oh!

lianhuiwang commented Feb 19, 2016

Uh oh!

jerryshao commented Feb 19, 2016

Uh oh!

lianhuiwang commented Feb 19, 2016

Uh oh!

jerryshao commented Feb 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Feb 19, 2016

Uh oh!

jerryshao commented Feb 20, 2016

Uh oh!

zhonghaihua commented Feb 22, 2016

Uh oh!

tgravescs commented Feb 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 25, 2016

Uh oh!

zhonghaihua commented Feb 25, 2016

Uh oh!

SparkQA commented Feb 25, 2016

Uh oh!

zhonghaihua commented Feb 26, 2016

Uh oh!

zhonghaihua commented Mar 3, 2016

Uh oh!

zhonghaihua commented Mar 26, 2016

Uh oh!

andrewor14 commented Mar 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!