Skip to content

Conversation

@zhonghaihua
Copy link
Contributor

Currently, when max number of executor failures reached the maxNumExecutorFailures, ApplicationMaster will be killed and re-register another one.This time, YarnAllocator will be created a new instance.
But, the value of property executorIdCounter in YarnAllocator will reset to 0. Then the Id of new executor will starting from 1. This will confuse with the executor has already created before, which will cause FetchFailedException.
This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, link to jira issues SPARK-12864
This PR introduce a mechanism to initialize executorIdCounter after ApplicationMaster killed.

@zhonghaihua zhonghaihua changed the title initialize executorIdCounter after ApplicationMaster killed for max n… [SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… Jan 17, 2016
@zhonghaihua
Copy link
Contributor Author

cc @rxin @marmbrus @chenghao-intel @jeanlyn could you give some advice ?

@zhonghaihua
Copy link
Contributor Author

@marmbrus @liancheng @yhuai Could you verify this patch?

@andrewor14
Copy link
Contributor

@vanzin @jerryshao IIRC there's a similar patch somewhere to fix this issue?

@jerryshao
Copy link
Contributor

Yes, this is the yarn-client only AM reattempt issue, I address this issue before by resetting the status of ExecutorAllocationManager and CoarseGrainedSchedulerBackend. But looks like there's still some stale state conflicts in BlockManager. For the details you could check the related JIRA.

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50524 has finished for PR 10794 at commit 30048ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhonghaihua
Copy link
Contributor Author

@andrewor14 Thanks for review it. Could this path merge to master ?

@zhonghaihua
Copy link
Contributor Author

@andrewor14 @marmbrus @rxin , any thoughts or concerns for this patch ?

@andrewor14
Copy link
Contributor

@zhonghaihua @jerryshao How is this related to #11205?

@jerryshao
Copy link
Contributor

@andrewor14 from my understanding I don't think it is the same issue.

@zhonghaihua
Copy link
Contributor Author

Hi, @andrewor14 , I agree with @jerryshao , I think that is not related to it.

@lianhuiwang
Copy link
Contributor

@jerryshao I think it needs to reset in CoarseGrainedSchedulerBackend when dynamicAllocation is not enabled. because it can clear information and come back to initialization state.

@jerryshao
Copy link
Contributor

So you mean we also need to clean the states in CoarseGrainedSchedulerBackend if AM failure occurs, even dynamic allocation is not enabled?

What specific behavior did you see? @lianhuiwang

@lianhuiwang
Copy link
Contributor

@jerryshao when yarn-client, because driver is always running, when AM failure occurs, some executors that is created by previous AM may still exist after second AM start. so I think we need to reset in CoarseGrainedSchedulerBackend that can remove all historical executors before second AM allocate new executors.

@jerryshao
Copy link
Contributor

I see, that's what I worried about. I thought about this potential issue previously, also conflict executor id may bring in race conditions. Let me think about a proper way to address it.

Utils.localHostName,
port,
sparkConf,
new SecurityManager(sparkConf))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why create another endpoint here? Can't we just use driverRef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andrewor14 , driverRef doesn't work in this case. Because, for my understanding, driverRef which endpoint name called YarnScheduler send message to YarnSchedulerEndpoint (or get message from YarnSchedulerEndpoint), while we should get max executorId from CoarseGrainedSchedulerBackend.DriverEndpoint which endpoint name called CoarseGrainedScheduler.

So, I think we should need a method to initialize executorIdCounter. And as you said, we should add huge comment huge comment related to SPARK-12864 to explain why we need to do this at this method. What‘s your opinion ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YarnSchedulerBackend extends CoarseGrainedSchedulerBackend, so what you mentioned can be achieved, you can check other codes inside the class to know how other codes handle this. Creating another endpoint is not necessary and weird here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jerryshao , thanks for your comments. I see what you mean, I will fix it soon. Thanks a lot.

@andrewor14
Copy link
Contributor

@jerryshao Can you clarify something: even after your fix in #9963 you still run into this issue right? Dynamic allocation or not, how did Spark ever continue to work across AM restarts?

@jerryshao
Copy link
Contributor

Hi @andrewor14 , in our implementation, currently when AM is failed all the related executors will be exited automatically, and driver will be notified with disconnection events and remove the related states. After then when the AM restarts, new executors will be registered into driver.

Here we assume all the executors will be exited before AM restarts. I'm afraid AM will possibly be restarted before all the executors are exited. To try to fix this, here in #9963 I cleaned executorDataMap when reset is invoked, but it is only for dynamic allocation enabled situation. like what @lianhuiwang mentioned, for dynamic allocation disabled situation we should also clean this state.

Beside, what I'm thinking is that there might be conflicted executor id issue, since executor id will be recalculated when AM restarts, which will be conflicted with old one. The issue may not only be in the driver side, but also in the external shuffle service (since now executor shuffle service requires executor id to do some recovery works), but I haven't yet met such issue till now.

@zhonghaihua
Copy link
Contributor Author

Hi @andrewor14 , the reason of test failed seems GitException. Could you retest it ? Thanks a lot.

@tgravescs
Copy link
Contributor

So we never intended to support the AM restart in client mode and having the driver handle that properly. I was expecting it to see the AM die and the driver to go away. At one point the AM attempts was set to 1 and I think we just never handled it when we changed it to be configurable.

We probably either need to test it out fully or just set the attempts to 1 for client mode.

protected val executorsPendingLossReason = new HashSet[String]

// The num of current max ExecutorId used to re-register appMaster
var currentExecutorIdCounter = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add scope keyword protected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jerryshao , thanks for your comments. The master branch is different from branch-1.5.x version. In master branch,CoarseGrainedSchedulerBackend is belong to module core and YarnSchedulerBackend is belong to module yarn , while in branch-1.5.x version it is belong to the same package. So, from my understanding, protected is unsuited here, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Though they're in different modules, still they're under same package, please see other variables like hostToLocalTaskCount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jerryshao , you are right. I fix it now. Thanks a lot.

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51938 has finished for PR 10794 at commit 3a1724c.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhonghaihua
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51971 has finished for PR 10794 at commit 659c505.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhonghaihua
Copy link
Contributor Author

Hi @andrewor14 , could you review this ? Thanks a lot.

@zhonghaihua
Copy link
Contributor Author

Hi @andrewor14 , any thoughts or concerns for this patch ?

@zhonghaihua
Copy link
Contributor Author

@andrewor14 @tgravescs @vanzin Could you verify this PR, or any thoughts or concerns for this ? Thanks a lot.

@andrewor14
Copy link
Contributor

This looks OK. Any thoughts @vanzin @tgravescs?

* Used to generate a unique ID per executor
*
* Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
* the id of new executor will start from 1, this will conflict with the executor has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to clarify this to say this is required for client mode when driver isn't running on yarn. this isn't an issue in cluster mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs I think we can clarify this in SPARK-12864 issue. @andrewor14 What's your opinion ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to do it here in this comment to better describe the situation this is needed. It should be a line or two and I personally much prefer that then pointing at jiras unless its a big discussion/background required, then the jira makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs Ok, I will do it soon. Thanks a lot.

@tgravescs
Copy link
Contributor

Also please update the description of this PR and jira to describe that this is happening in client mode due to driver not running on yarn.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54685 has finished for PR 10794 at commit ebe3c7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhonghaihua
Copy link
Contributor Author

@andrewor14 @tgravescs @vanzin The code and the comment is optimized. And the description of this PR and jira is also updated. Please review it again. Thanks a lot.

@vanzin
Copy link
Contributor

vanzin commented Apr 1, 2016

LGTM, I'll leave it to @tgravescs to do a final review.

@tgravescs
Copy link
Contributor

+1

@asfgit asfgit closed this in bd7b91c Apr 1, 2016
@tgravescs
Copy link
Contributor

@ zhonghaihua what is your jira id so I can assign it to you?

@zhonghaihua
Copy link
Contributor Author

Hi, @tgravescs , my jira id is Iward. Thanks a lot.

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Apr 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants