-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo #34043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This seems to happen if the MapOutputTracker uses broadcast when sending decommission statuses.
…ateBlockInfo Delegate task to threadpool and register callback for succesful completion. Reply to caller once future finished succesfully. To avoid java.util.ConcurrentModificationException we have to protect the blockLocations using locks.
|
ok to test |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143473 has finished for PR 34043 at commit
|
|
This is going to make analysis/evolution of the code more complex, so I want to understand better why are we doing this ? What is the current issue/bottleneck and how much of it is mitigated by this change ? |
The goal is to fix the deadlock that is described in detail in SPARK-36782. From our testing this deadlock is common enough meaning that we can not use the decommission features introduced in 3.1.1, since when the deadlock happens the driver does not seem to recover and the spark job en up failing. The patch should remove the possibility of this deadlock by making the |
|
Thanks for the details @eejbyfeldt, this makes sense ... this is an unfortunate side effect. |
|
Can we add a test which surfaces this issue ? |
|
Thanks for the jira and stack trace - that was really helpful ! Given there is no state mod for shuffle blocks when I am still testing locally, and making sure there are no issues - but the gist is:
For shuffle blocks, it will should executed outside of the Thoughts ? |
Good point! So I think we can delegate the |
Thanks @mridulm. This was very helpful. I agree it would be good to have a different architectural approach than the here presented extra locks. I tried out your suggestions in 306fa17 and my tests at least pass fine. Now the change is much more localized with no additional locking. I don't know whether the repo standard procedure is to close this PR and open a new one with your suggestions or to adapt the existing one - I just pushed my changes here assuming the latter but let me know in case you prefer to open a new PR. With the proposed changes I suppose we don't actually delegate it to the |
|
Kubernetes integration test starting |
@f-thiele We can get the |
|
Kubernetes integration test status failure |
|
@f-thiele @eejbyfeldt thanks for the fix. |
|
Test build #143506 has finished for PR 34043 at commit
|
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was trying it out, I moved the entire blockId match into the Future - but this is equivalent.
Thanks for fixing it quickly @f-thiele !
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
Show resolved
Hide resolved
|
@Ngone51 I was initially thinking along same lines - to move the shuffle map updates to Thoughts on this PR ? |
|
I do not think we have to delegate into |
|
Even the existing code shows this as when we talk about a shuffle block in the spark/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala Lines 582 to 598 in 26b3b11
|
|
This way (not delegating into |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143508 has finished for PR 34043 at commit
|
|
@attilapiros I have not followed node decommissioning discussions in detail, hopefully @holdenk or @dongjoon-hyun can comment more here. When looking at the affected codepaths, IMO handling the shuffle vs non-shuffle split while processing |
I think delegating to DAGScheduler is an alternative way to decouple BlockManagerMasterEndpoint and MapOutputTracker, which should also work in this case. The current fix looks good to me after removing the read/write lock. I'll approve it to catch up 3.2 cut.
I actually think it's cleaner to spread it out at the caller. It's confused the shuffle blocks are handled specifically comparing to other blocks. Ideally, I think we should send the RPC msg (e.g., having a new message called |
| if (isSuccess) { | ||
| listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) | ||
| } | ||
| context.reply(isSuccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes the responses of non-shuffle blocks also be handled in the thread pool. I'm afraid this introduces unexpected overhead. Shall we only do this for the shuffle blocks only and leave the non-shuffle block the same behavior as it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not realize this - thanks for pointing it out !
So if I understood it right, the proposal is:
def handleResult(success: Boolean): Unit = {
if (success) {
// post
}
context.reply(success)
}
if (blockId.isShuffle) {
updateShuffleBlockInfo( ... ).foreach( handleResult(_))
} else {
handleResult(updateBlockInfo( ... ))
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given @gengliangwang has merged it, can you create a follow up PR ? We can merge it pretty quickly and possible make that into current 3.2 RC as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
|
Merging to master/3.2/3.1 |
…ng UpdateBlockInfo ### What changes were proposed in this pull request? Delegate potentially blocking call to `mapOutputTracker.updateMapOutput` from within `UpdateBlockInfo` from `dispatcher-BlockManagerMaster` to the threadpool to avoid blocking the endpoint. This code path is only accessed for `ShuffleIndexBlockId`, other blocks are still executed on the `dispatcher-BlockManagerMaster` itself. Change `updateBlockInfo` to return `Future[Boolean]` instead of `Boolean`. Response will be sent to RPC caller upon successful completion of the future. Introduce a unit test that forces `MapOutputTracker` to make a broadcast as part of `MapOutputTracker.serializeOutputStatuses` when running decommission tests. ### Why are the changes needed? [SPARK-36782](https://issues.apache.org/jira/browse/SPARK-36782) describes a deadlock occurring if the `dispatcher-BlockManagerMaster` is allowed to block while waiting for write access to data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test as introduced in this PR. --- Ping eejbyfeldt for notice. Closes #34043 from f-thiele/SPARK-36782. Lead-authored-by: Fabian A.J. Thiele <[email protected]> Co-authored-by: Emil Ejbyfeldt <[email protected]> Co-authored-by: Fabian A.J. Thiele <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 4ea54e8) Signed-off-by: Gengliang Wang <[email protected]>
|
FYI: followup PR: #34076 |
…thread pool ### What changes were proposed in this pull request? This's a follow-up of #34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is. ### Why are the changes needed? To avoid any potential overhead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #34076 from Ngone51/spark-36782-follow-up. Authored-by: yi.wu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…thread pool ### What changes were proposed in this pull request? This's a follow-up of #34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is. ### Why are the changes needed? To avoid any potential overhead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #34076 from Ngone51/spark-36782-follow-up. Authored-by: yi.wu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 9d8ac7c) Signed-off-by: Gengliang Wang <[email protected]>
…ng UpdateBlockInfo ### What changes were proposed in this pull request? Delegate potentially blocking call to `mapOutputTracker.updateMapOutput` from within `UpdateBlockInfo` from `dispatcher-BlockManagerMaster` to the threadpool to avoid blocking the endpoint. This code path is only accessed for `ShuffleIndexBlockId`, other blocks are still executed on the `dispatcher-BlockManagerMaster` itself. Change `updateBlockInfo` to return `Future[Boolean]` instead of `Boolean`. Response will be sent to RPC caller upon successful completion of the future. Introduce a unit test that forces `MapOutputTracker` to make a broadcast as part of `MapOutputTracker.serializeOutputStatuses` when running decommission tests. ### Why are the changes needed? [SPARK-36782](https://issues.apache.org/jira/browse/SPARK-36782) describes a deadlock occurring if the `dispatcher-BlockManagerMaster` is allowed to block while waiting for write access to data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test as introduced in this PR. --- Ping eejbyfeldt for notice. Closes apache#34043 from f-thiele/SPARK-36782. Lead-authored-by: Fabian A.J. Thiele <[email protected]> Co-authored-by: Emil Ejbyfeldt <[email protected]> Co-authored-by: Fabian A.J. Thiele <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…thread pool ### What changes were proposed in this pull request? This's a follow-up of #34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is. ### Why are the changes needed? To avoid any potential overhead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #34076 from Ngone51/spark-36782-follow-up. Authored-by: yi.wu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 9d8ac7c) Signed-off-by: Gengliang Wang <[email protected]>
…ng UpdateBlockInfo ### What changes were proposed in this pull request? Delegate potentially blocking call to `mapOutputTracker.updateMapOutput` from within `UpdateBlockInfo` from `dispatcher-BlockManagerMaster` to the threadpool to avoid blocking the endpoint. This code path is only accessed for `ShuffleIndexBlockId`, other blocks are still executed on the `dispatcher-BlockManagerMaster` itself. Change `updateBlockInfo` to return `Future[Boolean]` instead of `Boolean`. Response will be sent to RPC caller upon successful completion of the future. Introduce a unit test that forces `MapOutputTracker` to make a broadcast as part of `MapOutputTracker.serializeOutputStatuses` when running decommission tests. ### Why are the changes needed? [SPARK-36782](https://issues.apache.org/jira/browse/SPARK-36782) describes a deadlock occurring if the `dispatcher-BlockManagerMaster` is allowed to block while waiting for write access to data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test as introduced in this PR. --- Ping eejbyfeldt for notice. Closes apache#34043 from f-thiele/SPARK-36782. Lead-authored-by: Fabian A.J. Thiele <[email protected]> Co-authored-by: Emil Ejbyfeldt <[email protected]> Co-authored-by: Fabian A.J. Thiele <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…thread pool ### What changes were proposed in this pull request? This's a follow-up of apache#34043. This PR proposes to only handle shuffle blocks in the separate thread pool and leave other blocks the same behavior as it is. ### Why are the changes needed? To avoid any potential overhead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes apache#34076 from Ngone51/spark-36782-follow-up. Authored-by: yi.wu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 9d8ac7c) Signed-off-by: Gengliang Wang <[email protected]>
What changes were proposed in this pull request?
Delegate potentially blocking call to
mapOutputTracker.updateMapOutputfrom withinUpdateBlockInfofromdispatcher-BlockManagerMasterto the threadpool to avoid blocking the endpoint. This code path is only accessed forShuffleIndexBlockId, other blocks are still executed on thedispatcher-BlockManagerMasteritself.Change
updateBlockInfoto returnFuture[Boolean]instead ofBoolean. Response will be sent to RPC caller upon successful completion of the future.Introduce a unit test that forces
MapOutputTrackerto make a broadcast as part ofMapOutputTracker.serializeOutputStatuseswhen running decommission tests.Why are the changes needed?
SPARK-36782 describes a deadlock occurring if the
dispatcher-BlockManagerMasteris allowed to block while waiting for write access to data structures.Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test as introduced in this PR.
Ping @eejbyfeldt for notice.