-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26689][CORE]Support blacklisting bad disk directory and retry in DiskBlockManager #23614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f10fbb9 to
b6a51e8
Compare
6ea3b97 to
c8cdde5
Compare
|
cc @srowen @vanzin @HyukjinKwon @squito @dongjoon-hyun |
|
Did you have blacklisting turned on when you ran into your failure? That seems like a more robust solution. |
|
@squito Yes, it's turned on. I think the exception I attached is not the business of application level blacklist(it only handles task failure), this is a driver side exception, even before |
|
oh I see, this is handling a bad disk on the driver, not the executors. A bad disk on the executors should be handled by blacklisting. In general, spark's fault tolerance for the driver is extremely poor, its a single point of failure for lots of reasons. Still, this may be a small fix to improve things somewhat, without any real guarantees. There are a lot of cases here to think through carefully. Eg. as noted in the comments, this has to be kept in sync w/ ExternalShuffleBlockResolver#getFile. Even if you put in the same logic, its still possible they'll end up with different views of the actual bad directories. Also probably want to update getAllFiles() as well to respect the badDirs. And what happens when a dir with lots of data in it is suddenly marked as bad, that all other internal state is updated reasonably (maybe not immediately, but when it is updated that it makes sense). I'm just thinking aloud about this change ... still on the fence of whether its good or not |
| var mostRecentFailure: Exception = null | ||
| // Update blacklist | ||
| val now = System.currentTimeMillis() | ||
| badDirs.dropWhile(now > dirToBlacklistExpiryTime(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method needs to be threadsafe, so badDirs and dirToBlacklistExpiryTime have to be protected or handle that somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squito yea, It should be threadsafe, but is there any case that the getFile method would be called concurrently?
I think it's only called by the DiskStore and IndexShuffleBlockResolver, seems they use this method in a sync way, is it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see, we may have this method called concurrently when running multi-cores applications, is that right?
|
Could |
|
@attilapiros Oh, I see. |
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest a bit different test as currently the test's chance of flakiness is quite high.
What about testing all the cases here both when rootDir0 is good and rootDir1 is bad and vice versa?
With the introduction of Clock and using a ManualClock in the test you can do that based on the expiry feature (so even without creating a new DiskBlockManager which without expiry would be needed as badDirs cannot be cleaned).
| test("test blacklisting bad disk directory") { | ||
| val blockId = new TestBlockId("1") | ||
| val hash = Utils.nonNegativeHash(blockId.name) | ||
| val (badDiskDir, goodDiskDir) = if (hash % 2 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the purpose of this randomisation for choosing goodDiskDir/badDiskDir from rootDirs is to test from run to run the case when not the good dir is chosen first by diskBlockManager.
I would strongly suggest to avoid this as sometimes the feature is not tested and the introduced bug could be detected at the testing of an unrelated different change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@attilapiros good suggestion, I have already update the UT and also fix a bug: when getFile for block writen before disk corruption, it will return a different result.
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
Outdated
Show resolved
Hide resolved
|
one of the comments above says this is for the Driver, but this is going to affect both executors and driver. I'm a bit on the fence about the change as well but a this point think if spark can be more robust we should do it. Note the configs would need to be documented and I agree with @squito would need to look closer at all the cases. This is in getFile but doesn't handle after you initially get the file but still an improvement. I think we should also make sure to log enough info so someone debugging can easily see things were blacklisted, perhaps print # of disks left after blacklist |
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
Outdated
Show resolved
Hide resolved
|
@attilapiros @squito any more suggestions? |
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
Outdated
Show resolved
Hide resolved
|
I'm sorry I haven't been able to look at this closely, but I do want to express my reservations about this. I'm worried about what'll happen with the external shuffle service when something appears to go wrong in the middle of an application, especially if its a temporary issue. I also want to make sure the memory use of the anyway I'm not blocking this, just want to make sure this is thought through carefully before its merged. |
|
@squito I know your concern, and I think you are right, there might be some risk about the inconsistent behavior with the external shuffle service, I need to think carefully about how to solve the problem. |
|
Can one of the admins verify this patch? |
What changes were proposed in this pull request?
We encoutered an application failure in our production cluster which caused by a single broken disk. It will cause application failure.
We have configured
yarn.nodemanager.local-dirswith multiple directories which are mounted on multiple disk, however, it still failed due to single disk broken. I think it's because spark does not handle disk broken inDiskBlockManagercurrently, even though there are always multi disk directories configured in a production environment.Though, we can probably bypass this error by enlarge settings like
spark.yarn.maxAppAttemps, but other apps may also suffer the same problem repeatedly which is annoying and unstable.... Also, for some large task, the retry will bring some downside, one major downside is that it will cause the delay of application completion(the extra recomputing time when retry forShuffleMapTaskand some extra schedule delay).Actually, we can handle this case by simply adding a few of codes, for localDirs already mounted on multiple disk for a production environment in most cases.
This PR will support blacklisting bad disk directory and switch to a good directory for writing. thus can improve the robustness.
How was this patch tested?
UT
Please review http://spark.apache.org/contributing.html before opening a pull request.