-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794
Conversation
…umber of executor failures reached
|
cc @rxin @marmbrus @chenghao-intel @jeanlyn could you give some advice ? |
|
@marmbrus @liancheng @yhuai Could you verify this patch? |
|
@vanzin @jerryshao IIRC there's a similar patch somewhere to fix this issue? |
|
Yes, this is the yarn-client only AM reattempt issue, I address this issue before by resetting the status of |
|
ok to test |
|
Test build #50524 has finished for PR 10794 at commit
|
|
@andrewor14 Thanks for review it. Could this path merge to master ? |
|
@andrewor14 @marmbrus @rxin , any thoughts or concerns for this patch ? |
|
@zhonghaihua @jerryshao How is this related to #11205? |
|
@andrewor14 from my understanding I don't think it is the same issue. |
|
Hi, @andrewor14 , I agree with @jerryshao , I think that is not related to it. |
|
@jerryshao I think it needs to reset in CoarseGrainedSchedulerBackend when dynamicAllocation is not enabled. because it can clear information and come back to initialization state. |
|
So you mean we also need to clean the states in What specific behavior did you see? @lianhuiwang |
|
@jerryshao when yarn-client, because driver is always running, when AM failure occurs, some executors that is created by previous AM may still exist after second AM start. so I think we need to reset in CoarseGrainedSchedulerBackend that can remove all historical executors before second AM allocate new executors. |
|
I see, that's what I worried about. I thought about this potential issue previously, also conflict executor id may bring in race conditions. Let me think about a proper way to address it. |
| Utils.localHostName, | ||
| port, | ||
| sparkConf, | ||
| new SecurityManager(sparkConf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why create another endpoint here? Can't we just use driverRef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andrewor14 , driverRef doesn't work in this case. Because, for my understanding, driverRef which endpoint name called YarnScheduler send message to YarnSchedulerEndpoint (or get message from YarnSchedulerEndpoint), while we should get max executorId from CoarseGrainedSchedulerBackend.DriverEndpoint which endpoint name called CoarseGrainedScheduler.
So, I think we should need a method to initialize executorIdCounter. And as you said, we should add huge comment huge comment related to SPARK-12864 to explain why we need to do this at this method. What‘s your opinion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YarnSchedulerBackend extends CoarseGrainedSchedulerBackend, so what you mentioned can be achieved, you can check other codes inside the class to know how other codes handle this. Creating another endpoint is not necessary and weird here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jerryshao , thanks for your comments. I see what you mean, I will fix it soon. Thanks a lot.
|
@jerryshao Can you clarify something: even after your fix in #9963 you still run into this issue right? Dynamic allocation or not, how did Spark ever continue to work across AM restarts? |
|
Hi @andrewor14 , in our implementation, currently when AM is failed all the related executors will be exited automatically, and driver will be notified with disconnection events and remove the related states. After then when the AM restarts, new executors will be registered into driver. Here we assume all the executors will be exited before AM restarts. I'm afraid AM will possibly be restarted before all the executors are exited. To try to fix this, here in #9963 I cleaned Beside, what I'm thinking is that there might be conflicted executor id issue, since executor id will be recalculated when AM restarts, which will be conflicted with old one. The issue may not only be in the driver side, but also in the external shuffle service (since now executor shuffle service requires executor id to do some recovery works), but I haven't yet met such issue till now. |
… and add some comment
|
Hi @andrewor14 , the reason of test failed seems |
|
So we never intended to support the AM restart in client mode and having the driver handle that properly. I was expecting it to see the AM die and the driver to go away. At one point the AM attempts was set to 1 and I think we just never handled it when we changed it to be configurable. We probably either need to test it out fully or just set the attempts to 1 for client mode. |
| protected val executorsPendingLossReason = new HashSet[String] | ||
|
|
||
| // The num of current max ExecutorId used to re-register appMaster | ||
| var currentExecutorIdCounter = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add scope keyword protected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jerryshao , thanks for your comments. The master branch is different from branch-1.5.x version. In master branch,CoarseGrainedSchedulerBackend is belong to module core and YarnSchedulerBackend is belong to module yarn , while in branch-1.5.x version it is belong to the same package. So, from my understanding, protected is unsuited here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Though they're in different modules, still they're under same package, please see other variables like hostToLocalTaskCount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jerryshao , you are right. I fix it now. Thanks a lot.
|
Test build #51938 has finished for PR 10794 at commit
|
|
retest this please. |
|
Test build #51971 has finished for PR 10794 at commit
|
|
Hi @andrewor14 , could you review this ? Thanks a lot. |
|
Hi @andrewor14 , any thoughts or concerns for this patch ? |
|
@andrewor14 @tgravescs @vanzin Could you verify this PR, or any thoughts or concerns for this ? Thanks a lot. |
|
This looks OK. Any thoughts @vanzin @tgravescs? |
| * Used to generate a unique ID per executor | ||
| * | ||
| * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then | ||
| * the id of new executor will start from 1, this will conflict with the executor has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to clarify this to say this is required for client mode when driver isn't running on yarn. this isn't an issue in cluster mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs I think we can clarify this in SPARK-12864 issue. @andrewor14 What's your opinion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to do it here in this comment to better describe the situation this is needed. It should be a line or two and I personally much prefer that then pointing at jiras unless its a big discussion/background required, then the jira makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs Ok, I will do it soon. Thanks a lot.
|
Also please update the description of this PR and jira to describe that this is happening in client mode due to driver not running on yarn. |
|
Test build #54685 has finished for PR 10794 at commit
|
|
@andrewor14 @tgravescs @vanzin The code and the comment is optimized. And the description of this PR and jira is also updated. Please review it again. Thanks a lot. |
|
LGTM, I'll leave it to @tgravescs to do a final review. |
|
+1 |
|
@ zhonghaihua what is your jira id so I can assign it to you? |
|
Hi, @tgravescs , my jira id is |
…onMaster killed for max num executor failures apache#10794
Currently, when max number of executor failures reached the
maxNumExecutorFailures,ApplicationMasterwill be killed and re-register another one.This time,YarnAllocatorwill be created a new instance.But, the value of property
executorIdCounterinYarnAllocatorwill reset to0. Then the Id of new executor will starting from1. This will confuse with the executor has already created before, which will cause FetchFailedException.This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, link to jira issues SPARK-12864
This PR introduce a mechanism to initialize
executorIdCounterafterApplicationMasterkilled.