Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Oct 27, 2019

What changes were proposed in this pull request?

As described in https://issues.apache.org/jira/browse/SPARK-27736, if a single external shuffle service process reboots and fails to recover the list of registered executors, a lot FetchFailedExceptions would be thrown and it would cause application failed eventually.

In this PR, I let externalBlockClient can query whether executors are registered on the External Shuffle Service.
And when fetchFailedException thrown, I will query whether the executors on this host are registered.
If not, unregister relative output.

Why are the changes needed?

This PR improves handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UT.

@turboFei turboFei force-pushed the SPARK-27736-executor-not-registered branch from b8f1c09 to 77aa0c1 Compare October 27, 2019 09:50
@turboFei
Copy link
Member Author

turboFei commented Oct 27, 2019

@dongjoon-hyun @HyukjinKwon
Can you help take a look? Thanks in advance.

@turboFei turboFei force-pushed the SPARK-27736-executor-not-registered branch 2 times, most recently from c7aa84b to 7692170 Compare October 27, 2019 10:53
… by ExternalShuffleService losing track of executor registrations
@turboFei turboFei force-pushed the SPARK-27736-executor-not-registered branch from 7692170 to 082b6a1 Compare October 27, 2019 10:54
@turboFei
Copy link
Member Author

@squito Hi, can you help take a look at this solution?
If it is OK, I will complete the UT and other details.
Thanks a lot.

@turboFei
Copy link
Member Author

cc @JoshRosen @tgravescs

@tgravescs
Copy link
Contributor

tgravescs commented Oct 28, 2019

Can you clarify the case you are trying to fix this for - what cluster manager, etc? Are you using YARN - if so do you not have node manager recovery enabled?

@turboFei
Copy link
Member Author

turboFei commented Oct 28, 2019

Can you clarify the case you are trying to fix this for - what cluster manager, etc? Are you using YARN - if so do you not have node manager recovery enabled?

Thanks for your reply.
@tgravescs
Yes, our cluster manager is yarn. I just checked that we do not have node manager recovery enabled.

@tgravescs
Copy link
Contributor

so is there a reason you don't just turn that on? it should fix this issue. I would assume if you are running other application types you have the same issue - map reduce and tez for example.

@turboFei
Copy link
Member Author

so is there a reason you don't just turn that on? it should fix this issue. I would assume if you are running other application types you have the same issue - map reduce and tez for example.

Thanks for your suggestion.

@squito
Copy link
Contributor

squito commented Oct 28, 2019

I agree with @tgravescs that yarn's NM recovery should solve this. That said, as Josh noted in the jira, we don't have a better solution for standalone and mesos. At a high level, I think this approach makes sense.

@turboFei
Copy link
Member Author

I agree with @tgravescs that yarn's NM recovery should solve this. That said, as Josh noted in the jira, we don't have a better solution for standalone and mesos. At a high level, I think this approach makes sense.

Thanks a lot, I will complete it later.

@squito
Copy link
Contributor

squito commented Oct 30, 2019

cc @attilapiros

@turboFei
Copy link
Member Author

I have added UT for ExternalBlockHandler and ExternalShuffleBlockResolver.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented May 9, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 9, 2020
@github-actions github-actions bot closed this May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants