-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset #15481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #66957 has finished for PR 15481 at commit
|
|
@zsxwing or @andrewor14 might know best on this one. |
|
Would be cleaner to simply copy executorDataMap.keys and works off that to ensure there is no coupling between actor thread and invoker. |
|
LGTM, sorry to bring in deadlock issue. |
|
Test build #67054 has finished for PR 15481 at commit
|
|
retest this please. |
|
Test build #67067 has finished for PR 15481 at commit
|
|
retest this please. |
|
Test build #67105 has finished for PR 15481 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset is called in YarnSchedulerEndpoint which ideally should not be a blocking action.
@jerryshao Can we just make it fire and forget?
| // Note: here copy the code of remove executor from DriverEndpoint to avoid deadlock(reset | ||
| // and removeExecutor both to get the lock of schedulerbackend.) | ||
| val reason = SlaveLost("Stale executor after cluster manager re-registered.") | ||
| executorDataMap.toMap.foreach { case (executorId, executorInfo) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
executorDataMap should not be modified outside DriverEndpoint. See the comment for executorDataMap.
| @volatile protected var currentExecutorIdCounter = 0 | ||
|
|
||
| // Executors that have been lost, but for which we don't yet know the real exit reason. | ||
| protected val executorsPendingLossReason = new HashSet[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they were in DriverEndpoint to make sure they won't be accessed outside DriverEndpoint since they are not thread-safe. This changes break it.
|
@scwf I think the initial fix with a small change might be sufficient. This will keep the synchronized block restricted to copying the executor keys, and leave the rest as-is removing the deadlock. Thoughts ? |
I'm wondering if we can also fix this. |
|
@zsxwing Ah, then simply making it send() instead of askWithRetry() should do, no ? |
|
Seems it could be changed to |
|
ok, i will revert to the initial commit. |
This reverts commit 2997ccb.
| // because (1) disconnected event is not yet received; (2) executors die silently. | ||
| executorDataMap.toMap.foreach { case (eid, _) => | ||
| driverEndpoint.askWithRetry[Boolean]( | ||
| driverEndpoint.send( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When changed to send, we don't do retry anymore. It may become less tolerant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only be invoked in yarn client mode when AM is failed and some lingering executors are existed, such situation may not be happened normally as I know. So I think it should be OK to call send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is actually just sending the message to the same process. retry is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also IIUC driverEndpoint is a in-process endpoint, so it should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is synchronizing on the entire method which is not required - we need to restrict synchronized block to the subset which is actually required - the copying of the executors to remove.
The snippet I posted earlier does the same - can you please modify it accordingly ?
If we had coded it that way initially, this bug wouldn't have existed to begin with (with or without send/askWithRetry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we change this to send, don't we need to move the handling of RemoveExecutor from receiveAndReply to receive in DriverEndpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think here we can just call removeExecutor(...).
|
Test build #67155 has finished for PR 15481 at commit
|
|
Updated, can you review again? |
|
LGTM, @zsxwing any comments ? |
|
BTW, it was interesting that the earlier change did not trigger a test failure (the issue @viirya pointed out - about needing to move RemoveExecutor to receive) |
|
Test build #67173 has finished for PR 15481 at commit
|
| } | ||
|
|
||
| case RemoveExecutor(executorId, reason) => | ||
| removeExecutor(executorId, reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor))?
|
Sorry that my comment was unclear. I meant we can just do the following changes: protected def reset(): Unit = synchronized {
numPendingExecutors = 0
executorsPendingToRemove.clear()
// Remove all the lingering executors that should be removed but not yet. The reason might be
// because (1) disconnected event is not yet received; (2) executors die silently.
executorDataMap.toMap.foreach { case (eid, _) =>
removeExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))
}
} |
|
@mridulm How about just the following changes? I agreed that |
|
@zsxwing I think the issue is that case RemoveExecutor() is not identical to what exists in receiveAndReply - which it should be. If we add that in there, will it not be sufficient ? |
|
I meant |
|
Ah ! Apologies, I got confused. Yes, I agree, that is a better approach. It also means we can get rid of the RemoveExecutor pattern match from receive right ? As it stands now, that looks buggy. |
yep |
|
|
|
Test build #67312 has finished for PR 15481 at commit
|
|
retest this please |
|
Test build #67322 has finished for PR 15481 at commit
|
|
LGTM now. |
|
LGTM. Merging to master and 2.0. Thanks! |
…eset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <[email protected]> Closes #15481 from scwf/spark-17929. (cherry picked from commit c1f344f) Signed-off-by: Shixiong Zhu <[email protected]>
…eset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <[email protected]> Closes apache#15481 from scwf/spark-17929.
…eset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <[email protected]> Closes apache#15481 from scwf/spark-17929.
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-17929
Now
CoarseGrainedSchedulerBackendreset will get the lock,but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock.