[SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset #15481

scwf · 2016-10-14T09:31:10Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-17929

Now CoarseGrainedSchedulerBackend reset will get the lock,

  protected def reset(): Unit = synchronized {
    numPendingExecutors = 0
    executorsPendingToRemove.clear()

    // Remove all the lingering executors that should be removed but not yet. The reason might be
    // because (1) disconnected event is not yet received; (2) executors die silently.
    executorDataMap.toMap.foreach { case (eid, _) =>
      driverEndpoint.askWithRetry[Boolean](
        RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
    }
  }

but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock.

   private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
      logDebug(s"Asked to remove executor $executorId with reason $reason")
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          val killed = CoarseGrainedSchedulerBackend.this.synchronized {
            addressToExecutorId -= executorInfo.executorAddress
            executorDataMap -= executorId
            executorsPendingLossReason -= executorId
            executorsPendingToRemove.remove(executorId).getOrElse(false)
          }
     ...

## How was this patch tested?

manual test.

SparkQA · 2016-10-14T11:51:31Z

Test build #66957 has finished for PR 15481 at commit 3681fae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-10-14T12:17:32Z

@zsxwing or @andrewor14 might know best on this one.

mridulm · 2016-10-14T18:11:09Z

Would be cleaner to simply copy executorDataMap.keys and works off that to ensure there is no coupling between actor thread and invoker.

jerryshao · 2016-10-17T07:04:58Z

LGTM, sorry to bring in deadlock issue.

SparkQA · 2016-10-17T08:09:21Z

Test build #67054 has finished for PR 15481 at commit 2997ccb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

scwf · 2016-10-17T10:58:39Z

retest this please.

SparkQA · 2016-10-17T19:56:30Z

Test build #67067 has finished for PR 15481 at commit 2997ccb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

scwf · 2016-10-18T02:45:08Z

retest this please.

SparkQA · 2016-10-18T05:00:24Z

Test build #67105 has finished for PR 15481 at commit 2997ccb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

reset is called in YarnSchedulerEndpoint which ideally should not be a blocking action.

@jerryshao Can we just make it fire and forget?

zsxwing · 2016-10-18T20:42:04Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+    // Note: here copy the code of remove executor from DriverEndpoint to avoid deadlock(reset
+    // and removeExecutor both to get the lock of schedulerbackend.)
+    val reason = SlaveLost("Stale executor after cluster manager re-registered.")
+    executorDataMap.toMap.foreach { case (executorId, executorInfo) =>


executorDataMap should not be modified outside DriverEndpoint. See the comment for executorDataMap.

zsxwing · 2016-10-18T20:42:07Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

  @volatile protected var currentExecutorIdCounter = 0

+  // Executors that have been lost, but for which we don't yet know the real exit reason.
+  protected val executorsPendingLossReason = new HashSet[String]


they were in DriverEndpoint to make sure they won't be accessed outside DriverEndpoint since they are not thread-safe. This changes break it.

mridulm · 2016-10-18T21:12:45Z

@scwf I think the initial fix with a small change might be sufficient.
What I meant was something like this :

protected def reset(): Unit = {
  val executors = synchronized {
    Set() ++ executorDataMap.keys
  }

  executors.foreach ( eid => driverEndpoint.askWithRetry[Boolean](
    RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
}

This will keep the synchronized block restricted to copying the executor keys, and leave the rest as-is removing the deadlock. Thoughts ?

zsxwing · 2016-10-18T21:15:14Z

reset is called in YarnSchedulerEndpoint which ideally should not be a blocking action.

I'm wondering if we can also fix this.

mridulm · 2016-10-18T21:19:00Z

@zsxwing Ah, then simply making it send() instead of askWithRetry() should do, no ?
That was actually in the initial PR - I was not sure if we want to change the behavior from askWithRetry to send ...

jerryshao · 2016-10-19T01:07:42Z

Seems it could be changed to send instead.

scwf · 2016-10-19T01:15:01Z

ok, i will revert to the initial commit.

This reverts commit 2997ccb.

viirya · 2016-10-19T03:08:29Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

    // because (1) disconnected event is not yet received; (2) executors die silently.
    executorDataMap.toMap.foreach { case (eid, _) =>
-      driverEndpoint.askWithRetry[Boolean](
+      driverEndpoint.send(


When changed to send, we don't do retry anymore. It may become less tolerant.

This will only be invoked in yarn client mode when AM is failed and some lingering executors are existed, such situation may not be happened normally as I know. So I think it should be OK to call send.

This call is actually just sending the message to the same process. retry is not needed.

Also IIUC driverEndpoint is a in-process endpoint, so it should be safe.

This is synchronizing on the entire method which is not required - we need to restrict synchronized block to the subset which is actually required - the copying of the executors to remove.

The snippet I posted earlier does the same - can you please modify it accordingly ?
If we had coded it that way initially, this bug wouldn't have existed to begin with (with or without send/askWithRetry)

Once we change this to send, don't we need to move the handling of RemoveExecutor from receiveAndReply to receive in DriverEndpoint?

Good catch. I think here we can just call removeExecutor(...).

SparkQA · 2016-10-19T03:20:34Z

Test build #67155 has finished for PR 15481 at commit af6072a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

scwf · 2016-10-19T05:54:13Z

Updated, can you review again?

mridulm · 2016-10-19T06:14:07Z

LGTM, @zsxwing any comments ?

mridulm · 2016-10-19T06:15:47Z

BTW, it was interesting that the earlier change did not trigger a test failure (the issue @viirya pointed out - about needing to move RemoveExecutor to receive)

viirya · 2016-10-19T07:14:51Z

@mridulm I checked #9963 and looks like we don't test against CoarseGrainedSchedulerBackend.reset.

SparkQA · 2016-10-19T08:22:56Z

Test build #67173 has finished for PR 15481 at commit 7d86054.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-19T08:55:30Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

        }
+
+      case RemoveExecutor(executorId, reason) =>
+        removeExecutor(executorId, reason)


Why remove executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor))?

zsxwing · 2016-10-19T17:18:41Z

Sorry that my comment was unclear. I meant we can just do the following changes:

  protected def reset(): Unit = synchronized {
    numPendingExecutors = 0
    executorsPendingToRemove.clear()

    // Remove all the lingering executors that should be removed but not yet. The reason might be
    // because (1) disconnected event is not yet received; (2) executors die silently.
    executorDataMap.toMap.foreach { case (eid, _) =>
      removeExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))
    }
  }

mridulm · 2016-10-19T21:30:24Z

@zsxwing To minimize scope of synchronized block.
The way @scwf has now, the synchronized block is limited to duplicating key and setting some state.
Remaining can happen outside of the lock.

zsxwing · 2016-10-19T22:10:04Z

@mridulm ask is very cheap. It just puts the serialized message into a buffer. The current codes now need to duplicate the codes and as @viirya pointed out, it misses executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor)).

How about just the following changes? I agreed that removeExecutor doesn't need to be in the synchronized block.

  protected def reset(): Unit = {
    val executors = synchronized {
      numPendingExecutors = 0
      executorsPendingToRemove.clear()
      Set() ++ executorDataMap.keys
    }

    // Remove all the lingering executors that should be removed but not yet. The reason might be
    // because (1) disconnected event is not yet received; (2) executors die silently.
    executors.foreach { eid =>
      removeExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))
    }
  }

mridulm · 2016-10-19T22:34:40Z

@zsxwing I think the issue is that case RemoveExecutor() is not identical to what exists in receiveAndReply - which it should be.
We should pull that out into a single method and have both pat match delegate to it.
Any reason 'executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor))' is missing from case RemoveExecutor(executorId, reason) in receive ?

If we add that in there, will it not be sufficient ?
I dont want to call removeExecutor from outside the event loop.

zsxwing · 2016-10-19T22:52:30Z

I meant CoarseGrainedSchedulerBackend.removeExecutor not DriverEndpoint.removeExecutor. It's confusing that we have two methods having the same name :(

mridulm · 2016-10-19T23:04:19Z

Ah ! Apologies, I got confused. Yes, I agree, that is a better approach.

It also means we can get rid of the RemoveExecutor pattern match from receive right ? As it stands now, that looks buggy.

zsxwing · 2016-10-19T23:10:12Z

It also means we can get rid of the RemoveExecutor pattern match from receive right ?

yep

scwf · 2016-10-21T00:48:31Z

CoarseGrainedSchedulerBackend.removeExecutor also use ask, but it does not matter right? because it just send msg once and log the error if failure. And if we use CoarseGrainedSchedulerBackend.removeExecutor, the removeExecutor should not in the synchronized block.

SparkQA · 2016-10-21T02:44:36Z

Test build #67312 has finished for PR 15481 at commit 7bf3bf8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-21T04:25:34Z

retest this please

SparkQA · 2016-10-21T06:52:05Z

Test build #67322 has finished for PR 15481 at commit 7bf3bf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-21T06:54:19Z

LGTM now.

zsxwing · 2016-10-21T21:43:29Z

LGTM. Merging to master and 2.0. Thanks!

…eset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <[email protected]> Closes #15481 from scwf/spark-17929. (cherry picked from commit c1f344f) Signed-off-by: Shixiong Zhu <[email protected]>

…eset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <[email protected]> Closes apache#15481 from scwf/spark-17929.

use send

3681fae

do not call ask, and copy the code of remove executor

2997ccb

zsxwing requested changes Oct 18, 2016

View reviewed changes

Revert "do not call ask, and copy the code of remove executor"

af6072a

This reverts commit 2997ccb.

viirya reviewed Oct 19, 2016

View reviewed changes

comment fix

7d86054

scwf force-pushed the spark-17929 branch from e0341b2 to 7d86054 Compare October 19, 2016 05:56

viirya reviewed Oct 19, 2016

View reviewed changes

comment fix

7bf3bf8

asfgit closed this in c1f344f Oct 21, 2016

[SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset #15481

[SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset #15481

Uh oh!

Conversation

scwf commented Oct 14, 2016

What changes were proposed in this pull request?

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

srowen commented Oct 14, 2016

Uh oh!

mridulm commented Oct 14, 2016

Uh oh!

jerryshao commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

scwf commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

scwf commented Oct 18, 2016

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm commented Oct 18, 2016

Uh oh!

zsxwing commented Oct 18, 2016

Uh oh!

mridulm commented Oct 18, 2016

Uh oh!

jerryshao commented Oct 19, 2016

Uh oh!

scwf commented Oct 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 19, 2016

Uh oh!

scwf commented Oct 19, 2016

Uh oh!

mridulm commented Oct 19, 2016

Uh oh!

mridulm commented Oct 19, 2016

Uh oh!

viirya commented Oct 19, 2016

Uh oh!

SparkQA commented Oct 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Oct 19, 2016

Uh oh!

zsxwing commented Oct 19, 2016

Uh oh!

mridulm commented Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

viirya Oct 19, 2016 •

edited

Loading

zsxwing commented Oct 19, 2016 •

edited

Loading

mridulm commented Oct 19, 2016 •

edited

Loading

scwf commented Oct 21, 2016 •

edited

Loading