-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27736][Core][SHUFFLE] Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations #26272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b8f1c09 to
77aa0c1
Compare
|
@dongjoon-hyun @HyukjinKwon |
c7aa84b to
7692170
Compare
… by ExternalShuffleService losing track of executor registrations
7692170 to
082b6a1
Compare
|
@squito Hi, can you help take a look at this solution? |
|
Can you clarify the case you are trying to fix this for - what cluster manager, etc? Are you using YARN - if so do you not have node manager recovery enabled? |
Thanks for your reply. |
|
so is there a reason you don't just turn that on? it should fix this issue. I would assume if you are running other application types you have the same issue - map reduce and tez for example. |
Thanks for your suggestion. |
|
I agree with @tgravescs that yarn's NM recovery should solve this. That said, as Josh noted in the jira, we don't have a better solution for standalone and mesos. At a high level, I think this approach makes sense. |
Thanks a lot, I will complete it later. |
|
cc @attilapiros |
|
I have added UT for ExternalBlockHandler and ExternalShuffleBlockResolver. |
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
As described in https://issues.apache.org/jira/browse/SPARK-27736, if a single external shuffle service process reboots and fails to recover the list of registered executors, a lot FetchFailedExceptions would be thrown and it would cause application failed eventually.
In this PR, I let externalBlockClient can query whether executors are registered on the External Shuffle Service.
And when fetchFailedException thrown, I will query whether the executors on this host are registered.
If not, unregister relative output.
Why are the changes needed?
This PR improves handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added UT.