-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed to start when the chosen dir… #18905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
jerryshao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that if work preserving is enabled (recovery path is not null), then user should guarantee the availability of this directory, am not sure if it is good to change to other directories (is yarn internally relying on it).
Also would you please add an unit test to verify your logics.
|
|
||
| /** | ||
| * Check the chosen DB file available or not. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is a thorough way to check disk healthy, in our internal case, we found that disk is not mounted (due to failure), and trying to write to this unmounted disk throws permission deny exception.
I'm thinking that disk unwritable is just one case of disk unhealthy, maybe we should check YARN's disk healthy check mechanism.
| /** | ||
| * Check the chosen DB file available or not. | ||
| */ | ||
| protected Boolean checkFileAvailable(File file) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two space indent for the java code.
| } | ||
| } | ||
|
|
||
| // If recovery path unavailable, no use it any more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think recovery path is set by user or use yarn default, user should make sure the availability of this directory, and yarn internally relies on it. It doesn't make sense to change to another disk if recovery path is unavailable.
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If _recoveryPath is still null I think we should throw an exception here, since none of the disk is good.
|
@jerryshao Thanks for your replies! I will do such things then:
|
|
@LiShuMing any update on this? |
|
Sorry, busy recently, I will update it today... |
|
ping @jerryshao I found a method to check disk in hadoop: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DiskChecker.java#L111 I add a unit test, Can you help me review my code? |
|
I have two questions about the fix:
CC @tgravescs to review. |
|
The recovery path returned by yarn is supposed to be reliable and if it isn't working then the NM itself shouldn't run. So in general you should just use that if you want spark to be able to recover. If you don't have yarn recovery enabled them there is no need for us to write the DBs at all and I think we should change to not do that. I think this jira is a dup of https://issues.apache.org/jira/browse/SPARK-17321 See my comments there. |
|
See another approach to solve this problem: #19032 and I will close this pr. Thanks @jerryshao @tgravescs . |
What changes were proposed in this pull request?
See SPARK-21660, this PR add one simple strategy to validate the chosen disk writable to avoid choosing a read-only disk.
How was this patch tested?
How to mock disk corrupted?
Before this pr, when we start the nodemanager, exception below:
After this pr: