Changes to support executor recovery behavior during static allocation. #244

varunkatta · 2017-04-26T14:41:29Z

What changes were proposed in this pull request?

Added initial support for driver to ask for more executors in case of framework faults.

Reviewer notes:
This is WIP and currently being tested. Seems to work for simple smoke-tests. Looking for feedback on

Any major blindspots in logic or functionality
General flow. Potential issues with control/data flows.
Are style guidelines followed.

Potential issues/Todos:

Verify that no deadlocks are possible.
May be explore message passing between threads instead of using synchronization
Any uncovered issues in further testing

Reviewer notes

Main business logic is in
removeFailedAndRequestNewExecutors()

Overall executor recovery logic at a high-level:

On executor disconnect, we immediately disable the executor.
Delete/Error Watcher actions will trigger a capture of executor loss reasons. This happens on a separate thread.
There is another dedicated recovery thread, which looks at all previously disconnected executors and their loss reasons to remove those executors with the right loss reasons or keep trying till the loss reasons' are discovered. If the loss reason of a lost executors is not discovered within a sufficient time window, then we give up and still remove the executor. For all removed executors, we request new executors on this recovery thread.

How was this patch tested?

Manually tested that on deleting a pod, new pods were being requested.

erikerlandson · 2017-04-26T17:16:54Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/KubernetesClientBuilder.scala

 import io.fabric8.kubernetes.client.{Config, ConfigBuilder, DefaultKubernetesClient}
-
+import okhttp3.Dispatcher


I believe spark style favors fully-scoped import identifiers

Is okhttp a dependency we've added? I don't see it showing up elsewhere in the repo.

Java kubernetes client API has a dependency on okhttp. okhttp is exposed as the default Dispatcher constructor allows only non-daemon threads in their Executors, and we wanted to able to specify a custom dispatcher.

erikerlandson · 2017-04-26T17:34:47Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

-        runningExecutorPods.remove(executor) match {
-          case Some(pod) => kubernetesClient.pods().delete(pod)
+        runningExecutorsToPods.remove(executor) match {
+          case Some(pod) =>


There is some scala community debate about whether case Some(...) / case None should be used, or option.fold I'm unsure if the spark style guide has an opinion, and I don't have a strong personal opinion.

Strongly prefer not to match on Options anywhere.

Thanks..this is addressed now.

erikerlandson · 2017-04-26T17:36:32Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+  val PMEM_EXCEEDED_EXIT_CODE = -104
+
+  def memLimitExceededLogMessage(diagnostics: String): String = {
+    s"Container killed by YARN for exceeding memory limits.$diagnostics" +


should this be YARN ?

oops..obviously no. :) fixing it

kimoonkim · 2017-04-28T17:55:46Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/KubernetesClientBuilder.scala

+      .withWebsocketPingInterval(0)
+      .build()
+    val httpClient = HttpClientUtils.createHttpClient(config).newBuilder()
+      .dispatcher(new Dispatcher(threadPoolExecutor))


The latest code in #216 simplifies this code by using ThreadUtils.newDaemonCachedThreadPool. We may want to do this here to be in sync.

Thanks..this is less verbose and easier on eyes.

kimoonkim · 2017-04-28T17:59:21Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

@@ -93,8 +102,11 @@ private[spark] class KubernetesClusterSchedulerBackend(
      super.minRegisteredRatio
    }

+  private val executorWatchResource = new AtomicReference[Closeable]
+  private val executorCleanupScheduler = Executors.newScheduledThreadPool(1)


Do we know if this uses a daemon thread?

kimoonkim · 2017-04-28T18:01:33Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    }
+
+    def getContainerExitStatus(containerStatus: ContainerStatus): Int = {
+      containerStatus.getState.getTerminated.getExitCode.intValue()


No need of () in Scala.

kimoonkim · 2017-04-28T18:03:28Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+    def handleErroredPod(pod: Pod): Unit = {
+      val alreadyReleased = RUNNING_EXECUTOR_PODS_LOCK.synchronized {
+        runningPodsToExecutors.contains(pod)


I wonder if this does object pointer equality as opposed to value or pod name equality. And pods at different event time point can be different objects. Then this check might not work for us.

This was subtle. Yes, I do string equality of pod names now.

kimoonkim · 2017-04-28T18:09:10Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      if (action == Action.ERROR) {
+        val podName = pod.getMetadata.getName
+        logDebug(s"Received pod $podName exited event. Reason: " + pod.getStatus.getReason)
+        getContainerExitStatus(pod)


The return value doesn't seem to be used. Maybe we can remove this line?

kimoonkim · 2017-04-28T18:14:59Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      } else {
+        val containerExitReason = containerExitStatus match {
+          case VMEM_EXCEEDED_EXIT_CODE | PMEM_EXCEEDED_EXIT_CODE =>
+            memLimitExceededLogMessage(pod.getStatus.getReason)


I remember we discussed "mem exceeded" should be the framework fault in the google doc. I am confused by line 353 setting exitCausedByApp = true?

Only, if the pod get explicitly deleted, we deem as framework fault. Since Spark actively tries to manage memory, it makes more sense to classify it as application fault?

kimoonkim · 2017-04-28T18:18:07Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    }
+  }
+
+  private val executorCleanupRunnable: Runnable = new Runnable {


Maybe change the variable name to indicate this requests new executors? "Cleanup" does not indicate the impact of what this does, IMO. Maybe executorRecoveryRunnable?

kimoonkim · 2017-04-28T18:18:57Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+    def removeFailedAndRequestNewExecutors(): Unit = {
+      val localRunningExecutorsToPods = RUNNING_EXECUTOR_PODS_LOCK.synchronized {
+        runningExecutorsToPods.toMap


Does this create a copy?

yes an immutable copy.

kimoonkim · 2017-04-28T18:23:34Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+  }
+
+  private val executorCleanupRunnable: Runnable = new Runnable {
+    private val removedExecutors = new mutable.HashSet[String]


Maybe rename to executorsToRecover to indicate the significance of action we do for these?

kimoonkim · 2017-04-28T18:25:55Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+                logDebug(s"Removing executor $executorId with loss reason "
+                  + executorExited.message)
+                if (!executorExited.exitCausedByApp) {
+                  removedExecutors.add(executorId)


Shouldn't we also subject this to some maximum retry?

This is addressed in the latest diff

kimoonkim · 2017-05-04T20:34:03Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

-        requestExecutors(removedExecutors.size)
+      if (executorsToRecover.nonEmpty &&
+        recoveredExecutorCount < MAX_ALLOWED_EXECUTOR_RECOVERY_ATTEMPTS) {
+        requestExecutors(executorsToRecover.size)


Compute numExecutorsToRecover as Math.min(executorsToRecover.size, MAX_ALLOWED_EXECUTOR_RECOVERY_ATTEMPTS - recoveredExecutorCount) and pass it to requestExecutors so that we don't go above the max?

Ditto..dropping this code.

kimoonkim · 2017-05-04T20:35:14Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+  private val executorRecoveryRunnable: Runnable = new Runnable {
+
+    private val MAX_EXECUTOR_LOST_REASON_CHECKS = 10
+    private val MAX_ALLOWED_EXECUTOR_RECOVERY_ATTEMPTS = 100


Hmm. This is global. I was wondering if we can use max per executor ID.

I am dropping max allowed recovery attempts at it is not immediately obvious what should be a good default for this. Keeping this simple and similar to behavior on Yarn

kimoonkim · 2017-05-04T20:36:00Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+        removeExecutor(executorId, SlaveLost("Executor lost for unknown reasons"))
+        executorsToRecover.add(executorId)
+      } else {
+        executorAttempts.put(executorId, reasonCheckCount + 1)


s/executorAttempts/executorReasonChecks/

varunkatta · 2017-05-05T00:48:32Z

Logs from testing:

Observed that killing an executor, as it was disabled first with loss reason pending, and later removed with reason received from K8s master, and new executor summoned due to a lost executor due to framework fault.

2017-05-04 23:48:47 INFO  KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Disabling executor 1.
2017-05-04 23:48:47 INFO  BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2017-05-04 23:48:47 INFO  BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(1, 10.42.0.3, 7079, None)
2017-05-04 23:48:47 INFO  KubernetesClusterSchedulerBackend:54 - Received action DELETED for pod  spark-pi-1493941665060-exec-1
2017-05-04 23:48:47 INFO  KubernetesClusterSchedulerBackend:54 - Received delete pod spark-pi-1493941665060-exec-1 event. Reason: null
2017-05-04 23:48:47 INFO  BlockManagerMaster:54 - Removed 1 successfully in removeExecutor
2017-05-04 23:48:47 INFO  DAGScheduler:54 - Shuffle files lost for executor: 1 (epoch 0)
2017-05-04 23:48:52 INFO  KubernetesClusterSchedulerBackend:54 - Requesting 1 additional executor(s) from the cluster manager
2017-05-04 23:48:52 INFO  KubernetesClusterSchedulerBackend:54 - Requesting 1 additional executors, expecting total 4 and currently expected 3
2017-05-04 23:48:52 ERROR TaskSchedulerImpl:70 - Lost executor 1 on 10.42.0.3: Pod spark-pi-1493941665060-exec-1 deleted by K8s master
2017-05-04 23:48:52 INFO  TaskSetManager:54 - Task 250 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
2017-05-04 23:48:52 INFO  KubernetesClusterSchedulerBackend:54 - Received action ADDED for pod  spark-pi-1493941665060-exec-4

kimoonkim · 2017-05-05T17:35:55Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala


  private val executorDockerImage = conf.get(EXECUTOR_DOCKER_IMAGE)
  private val kubernetesNamespace = conf.get(KUBERNETES_NAMESPACE)
  private val executorPort = conf.getInt("spark.executor.port", DEFAULT_STATIC_PORT)
  private val blockmanagerPort = conf
    .getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT)

+  private val kubernetesDriverServiceName = conf


Is this from your change? Maybe from a wrong merge?

You are right..merge fail. Fixed now.

kimoonkim · 2017-05-05T17:43:33Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+            // be the right default since we know the pod was not explicitly deleted by the user.
+            "Pod exited with following container exit status code " + containerExitStatus
+        }
+        ExecutorExited(containerExitStatus, exitCausedByApp = true, containerExitReason)


Maybe this else block needs extra indentation? Hard to realized this is being assigned to exitReason.

I see spark code with and without indentation in this case. I like you suggestion of having indentation. Made this change.

varunkatta · 2017-05-05T20:25:15Z

@foxish Can you have a quick look at the change once..I want to make sure that we don't accidentally run into issues you uncovered during dynamic allocation work on this change too where expected pods * 2 seem to be allocated by K8s. Also, I am guessing you might have a more pointed input on the general functionality.

kimoonkim

LGTM. I was able to follow the code relatively easily with the latest diff. Thanks for writing this PR.

@foxish As @varunkatta suggested, probably it's best for you to take a look at this next? I wonder how your dynamic allocation PR would possibly interact with this PR. Maybe they just complement each other, which would be really great.

kimoonkim · 2017-05-05T21:42:13Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+                  + executorExited.message)
+                removeExecutor(executorId, executorExited)
+                if (!executorExited.exitCausedByApp) {
+                  executorsToRecover.add(executorId)


Maybe you want to update the PR description with this code snippet saying this is the main business logic.

foxish · 2017-05-05T21:50:02Z

I'll take a look today. Thanks!

…

On May 5, 2017 2:48 PM, "Kimoon Kim" ***@***.***> wrote: ***@***.**** approved this pull request. LGTM. I was able to follow the code relatively easily with the latest diff. Thanks for writing this PR. @foxish <https://github.com/foxish> As @varunkatta <https://github.com/varunkatta> suggested, probably it's best for you to take a look at this next? I wonder how your dynamic allocation PR would possibly interact with this PR. Maybe they just complement each other, which would be really great. ------------------------------ In resource-managers/kubernetes/core/src/main/scala/org/ apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBack end.scala <#244 (comment)> : > + val localFailedPods = FAILED_PODS_LOCK.synchronized { + failedPods.toMap + } + val localExecutorsToRemove = EXECUTORS_TO_REMOVE_LOCK.synchronized { + executorsToRemove.toSet + } + localExecutorsToRemove.foreach { case (executorId) => + localRunningExecutorsToPods.get(executorId) match { + case Some(pod) => + localFailedPods.get(pod.getMetadata.getName) match { + case Some(executorExited: ExecutorExited) => + logDebug(s"Removing executor $executorId with loss reason " + + executorExited.message) + removeExecutor(executorId, executorExited) + if (!executorExited.exitCausedByApp) { + executorsToRecover.add(executorId) Maybe you want to update the PR description with this code snippet saying this is the main business logic. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#244 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3U55jsUHe46RL-LztGKiUUx9_vRVe5ks5r25kQgaJpZM4NI9sy> .

…location

foxish · 2017-05-08T17:12:20Z

...e/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClientBuilder.scala

+    // the driver main thread to shut down upon errors. Otherwise, the driver
+    // will hang indefinitely.
+    val config = configBuilder
+      .withWebsocketPingInterval(0)


Why are we changing the websocket ping interval here?

This is to work around a bug in the web socket ping thread, which is created as non-daemon thread and let the driver hang in case an exception is thrown in the driver main thread. See the comment at line 85 - 87. More details at PR 216 comment with the code snippet of how the web socket ping thread is created.

foxish · 2017-05-08T17:13:28Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

-    scheduler: TaskSchedulerImpl,
-    val sc: SparkContext)
+private[spark] class KubernetesClusterSchedulerBackend(scheduler: TaskSchedulerImpl,
+                                                       val sc: SparkContext)


The arguments indentation style doesn't adhere to scala convention. I'm guessing this is the IDE you're using. We should revert these unintended changes.

foxish · 2017-05-08T17:52:24Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

    if (conf.getOption("spark.scheduler.minRegisteredResourcesRatio").isEmpty) {
      0.8
    } else {
      super.minRegisteredRatio
    }

+  private val executorWatchResource = new AtomicReference[Closeable]


This can be of type Watch

foxish · 2017-05-08T17:56:02Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

-  private val runningExecutorPods = new scala.collection.mutable.HashMap[String, Pod]
+  private val RUNNING_EXECUTOR_PODS_LOCK = new Object
+  private val runningExecutorsToPods = new mutable.HashMap[String, Pod] // Indexed by executor IDs.
+  private val runningPodsToExecutors = new mutable.HashMap[Pod, String] // Indexed by executor Pods.


Can we do without runningPodsToExecutors? It doesn't seem like it's being used for its index.

Seems like I missed commenting on this. We do use it to see, if the pod has already exited.

foxish · 2017-05-08T17:57:56Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    def getContainerExitStatus(pod: Pod): Int = {
+      val containerStatuses = pod.getStatus.getContainerStatuses.asScala
+      for (containerStatus <- containerStatuses) {
+        return getContainerExitStatus(containerStatus)


If we're always returning the first container's exit status, we can avoid the loop here and perhaps just fetch the status directly?

foxish · 2017-05-08T18:00:43Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+  private val PMEM_EXCEEDED_EXIT_CODE = -104
+
+  def memLimitExceededLogMessage(diagnostics: String): String = {
+    s"Pod/Container killed for exceeding memory limits.$diagnostics" +


nit: space after period.

foxish · 2017-05-08T18:01:21Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+    private val MAX_EXECUTOR_LOST_REASON_CHECKS = 10
+    private val executorsToRecover = new mutable.HashSet[String]
+    private val executorReasonChecks = new mutable.HashMap[String, Int]


Can you please add a comment here explaining what executorReasonChecks is?

foxish · 2017-05-08T18:08:16Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+          }
+          ExecutorExited(containerExitStatus, exitCausedByApp = true, containerExitReason)
+        }
+      FAILED_PODS_LOCK.synchronized {


@mccheah is there a concurrent lock-free version of the map that we can use that doesn't need locking everywhere? like - scala.collection.concurrentMap

+1 for this.

Thanks..Made failedPods map a concurrent hash map. It is not truly lock free as there are implicit locks but explicit locking by the user is not required.

foxish · 2017-05-08T18:09:41Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+    def handleDeletedPod(pod: Pod): Unit = {
+      val exitReason = ExecutorExited(getContainerExitStatus(pod), exitCausedByApp = false,
+        "Pod " + pod.getMetadata.getName + " deleted by K8s master")


Need not necessarily deleted by the master. Maybe we can say - Pod <x> lost/deleted

foxish · 2017-05-08T18:11:29Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+  private val executorRecoveryRunnable: Runnable = new Runnable {
+
+    private val MAX_EXECUTOR_LOST_REASON_CHECKS = 10


I'm guessing we want this knob to be something the user controls, via some spark property?

This knob need not be controlled by user. It is a very specific knob tied to implementation, and users shouldn't be worrying about tuning this, I think.

foxish · 2017-05-08T18:14:01Z

Thanks @varunkatta.
Sorry about the delay in reviewing. The overall logic looks about right; but there are some comments I've left around cleaning up that we can do. I'd prefer having fewer separate threads/watchers responsible for recovery and accounting.

…nkatta/spark into executor-recovery-static-allocation

ash211 · 2017-05-18T05:26:37Z

rerun unit tests please

foxish · 2017-05-22T21:12:01Z

@varunkatta PR needs rebase.

tnachen · 2017-06-06T00:39:40Z

...e/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClientBuilder.scala

@@ -17,6 +17,7 @@
 package org.apache.spark.scheduler.cluster.kubernetes

 import java.io.File
+import java.util.concurrent.{ThreadFactory, ThreadPoolExecutor}


erikerlandson · 2017-06-08T21:27:55Z

@varunkatta PR is out of sync again

liyinan926 · 2017-07-18T15:04:00Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      val localRunningExecutorsToPods = RUNNING_EXECUTOR_PODS_LOCK.synchronized {
+        runningExecutorsToPods.toMap
+      }
+      executorsToRemove.foreach { case (executorId) =>


Is _executorsToRemoveMap guarded by its internal lock when being iterated through? The Java ConcurrentHashMap is not. If not, this foreach needs to be guarded by a lock. So I suggest you keep the original implementation of using a explicit lock object with a normal map.

There is no need for the iteration to be guarded. Previous guard was used for thread safety. ConcurrentHashMap is being used as a thread-safe map here. Since, iteration is happening only on a single consumer thread, there is no need to lock the entire map. If elements get added to this map by a producer thread, during the consumer thread iteration on the map - that is an acceptable logic here. Wondering, if I am missing something.

OK, if insertion during iteration is an accepted logic (which is what I was not sure), then yes locking the map is not needed.

Cool, thanks

liyinan926 · 2017-07-18T17:47:20Z

LGTM.

liyinan926 · 2017-07-18T18:15:24Z

BTW: please squash the commits.

…c-allocation

varunkatta · 2017-07-19T21:06:19Z

rerun unit tests please

foxish · 2017-07-19T22:21:59Z

rerun unit tests please

foxish · 2017-07-19T22:22:28Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

 import io.fabric8.kubernetes.client.{KubernetesClient, KubernetesClientException, Watcher}
 import io.fabric8.kubernetes.client.Watcher.Action
+import java.{lang, util}


import ordering

We shouldn't just be importing the packages, but the specific classes in question.

(The exception is for our config and constants where we import all contents of the package object)

agreed..will address.

Revisiting this. If we don't import the package here, how do we distinguish between scala.collection.concurrent.Map and java.util.Map? Right now we use the package name only as a qualifier. We are not importing all classes under util just the Map class with the right qualifier.

Fully qualify the class name in the usage in the code itself. That being said, we shouldn't need to ever refer to the Java version - work entirely in Scala primitives and invoke asJava whenever we need the Java version.

If it's absolutely necessary then the import can be aliased as follows:

import java.util.{List => JList}

Then whenever one refers to java.util.List the code can just use JList.

Agree that never referring to Java version is the right advice to follow. I refactored the code a bit, so we are not explicitly referring to the java Map interface. I am using Java ConcurrentHashMap as there doesn't seem to be any legitimate native scala concurrent hashmap implementations (or they do exist and I failed to find one). Let me know, if this use is acceptable.

foxish · 2017-07-19T22:37:14Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    def getExecutorExitStatus(pod: Pod): Int = {
+      val containerStatuses = pod.getStatus.getContainerStatuses
+      if (!containerStatuses.isEmpty) {
+        return getExecutorExitStatus(containerStatuses.get(0))


This assumes that the first container is what we want. Safe assumption for now, but may not hold in the future. No need to change it atm, but good to know that we are assuming the first container is the main executor.

I will leave a comment here.

foxish · 2017-07-19T22:47:23Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+            removeExecutorOrIncrementLossReasonCheckCount(executorId)
+        }
+      }
+      executorsToRecover.foreach(executorId => {


I'm a bit lost. Where do we recover the executor and assign it the same ID as before?

With the new dynamic allocator thread - new executor requests are happening on this allocator thread. InremoveFailedExecutorsmethod, the failed pod is removed from the internal bookkeeping map runningExecutorsToPods and with this a new executor ends up getting created by the allocator.

Does the allocation thread reuse old executor IDs? Because I saw that we are maintaining a map of executorID -> count of checks performed. I was wondering if it makes sense to have (executorID->count) mapping, because the executors are identical; and they get rescheduled on different nodes potentially. So, executor#1 failing twice doesn't necessarily indicate a problem with that executor, but with all executors.

Maybe we can get away with specifying a total number of failures instead?

(executorID->count) is used to maintain the number of attempts made per executor to learn the actual loss reason before we give up and assume it is a framework fault as we can't possibly be trying forever. Since, this caused some confusion during the first iteration, I will add a comment to clarify the reason for the map's existence.

Also, We don't reuse the old executor ids. All new executors get a new id, and these ids are monotonically increasing. That is in line with spark standalone and yarn.

Ah! Okay, makes sense.

foxish · 2017-07-20T18:37:16Z

LGTM after open comments addressed.
Hoping to merge this today/tomorrow, as the last PR before we cut a new release.

mccheah · 2017-07-20T18:39:15Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+        localRunningExecutorsToPods.get(executorId) match {
+          case Some(pod) =>
+            failedPods.get(pod.getMetadata.getName) match {
+              case Some(executorExited: ExecutorExited) =>


Never match on Options, use the functional API instead

mccheah · 2017-07-20T18:39:54Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+        executorReasonChecks -= executorId
+        RUNNING_EXECUTOR_PODS_LOCK.synchronized {
+          runningExecutorsToPods.remove(executorId) match {
+            case Some(pod) =>


Never match on Options, always use the functional API instead

mccheah · 2017-07-20T18:41:19Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

      if (action == Action.MODIFIED && pod.getStatus.getPhase == "Running"
-          && pod.getMetadata.getDeletionTimestamp == null) {
+        && pod.getMetadata.getDeletionTimestamp == null) {


Nit: Push this indentation in 4 spaces

mccheah · 2017-07-20T18:41:31Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

        val podIP = pod.getStatus.getPodIP
        val clusterNodeName = pod.getSpec.getNodeName
        logDebug(s"Executor pod $pod ready, launched at $clusterNodeName as IP $podIP.")
        EXECUTOR_PODS_BY_IPS_LOCK.synchronized {
          executorPodsByIPs += ((podIP, pod))
        }
      } else if ((action == Action.MODIFIED && pod.getMetadata.getDeletionTimestamp != null) ||
-          action == Action.DELETED || action == Action.ERROR) {
+        action == Action.DELETED || action == Action.ERROR) {


Nit: Push this indentation in 4 spaces

mccheah · 2017-07-20T18:41:52Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+          logInfo(s"Received pod $podName exited event. Reason: " + pod.getStatus.getReason)
+          handleErroredPod(pod)
+        }
+        else if (action == Action.DELETED) {


Move the else up to the same line as the closing brace of the previous if.

mccheah · 2017-07-20T18:42:34Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    def getExecutorExitStatus(pod: Pod): Int = {
+      val containerStatuses = pod.getStatus.getContainerStatuses
+      if (!containerStatuses.isEmpty) {
+        return getExecutorExitStatus(containerStatuses.get(0))


Avoid using return - can just use

if (!containerStatus.isEmpty) getExecutorExitStatus(...) else DEFAULT_CONTAINER_FAILURE_EXIT_STATUS

mccheah · 2017-07-20T18:43:00Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    }
+
+    def handleErroredPod(pod: Pod): Unit = {
+      def isPodAlreadyReleased(pod: Pod): Boolean = {


Avoid inner method definitions - move this outside somewhere

mccheah · 2017-07-20T18:43:56Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+    def getExecutorExitStatus(containerStatus: ContainerStatus): Int = {
+      containerStatus.getState match {
+        case null => UNKNOWN_EXIT_CODE


Don't match on null - use Option(containerStatus.getState).map(...).getOrElse

mccheah · 2017-07-20T18:44:45Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

@@ -160,7 +173,7 @@ private[spark] class KubernetesClusterSchedulerBackend(
    }
  }

-  override val minRegisteredRatio =
+  override val minRegisteredRatio: Double =


For val there is no need to declare the type.

mccheah · 2017-07-20T18:45:06Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+import org.apache.spark.rpc.{RpcAddress, RpcCallContext, RpcEndpointAddress, RpcEnv}
+import org.apache.spark.scheduler.{ExecutorExited, ExecutorLossReason, SlaveLost, TaskSchedulerImpl}
+import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.{RetrieveSparkAppConfig,
+SparkAppConfig}


Imports are not subject to line length restrictions, so move this up a line.

foxish · 2017-07-21T17:59:30Z

Thanks @varunkatta
@mccheah PTAL.
Squash and merge on green if all style comments are addressed.

…c-allocation

foxish · 2017-07-21T23:36:05Z

Merging. Please address any other style items in a future PR.

…n. (#244) * Changes to support executor recovery behavior during static allocation. * addressed review comments * Style changes and removed inocrrectly merged code * addressed latest review comments * changed import order * Minor changes to avoid exceptions when exit code is missing * fixed style check * Addressed review comments from Yinan LiAddressed review comments from Yinan Li.. * Addressed comments and got rid of an explicit lock object. * Fixed imports order. * Addressed review comments from Matt * Couple of style fixes

…n. (apache-spark-on-k8s#244) * Changes to support executor recovery behavior during static allocation. * addressed review comments * Style changes and removed inocrrectly merged code * addressed latest review comments * changed import order * Minor changes to avoid exceptions when exit code is missing * fixed style check * Addressed review comments from Yinan LiAddressed review comments from Yinan Li.. * Addressed comments and got rid of an explicit lock object. * Fixed imports order. * Addressed review comments from Matt * Couple of style fixes

Changes to support executor recovery behavior during static allocation.

a8831b7

erikerlandson reviewed Apr 26, 2017

View reviewed changes

foxish mentioned this pull request Apr 27, 2017

Handle executor loss during job execution #151

Closed

kimoonkim suggested changes Apr 28, 2017

View reviewed changes

This was referenced Apr 28, 2017

Discuss how to make it easier to debug when executors die because of memory limit #247

Open

Dispatch tasks to right executors that have tasks' input HDFS data #216

Merged

addressed review comments

c4b949f

kimoonkim reviewed May 4, 2017

View reviewed changes

Addressed latest comments and merged upstream changes

d87d393

varunkatta changed the title ~~WIP: Changes to support executor recovery behavior during static allocation.~~ Changes to support executor recovery behavior during static allocation. May 5, 2017

kimoonkim reviewed May 5, 2017

View reviewed changes

Style changes and removed inocrrectly merged code

4dcb1b3

kimoonkim approved these changes May 5, 2017

View reviewed changes

Merge branch 'branch-2.1-kubernetes' into executor-recovery-static-al…

5a064ba

…location

foxish reviewed May 8, 2017

View reviewed changes

varunkatta added 4 commits May 11, 2017 15:48

Merged with head on 2.1 kubernetes branch

fbe4b18

Merge branch 'executor-recovery-static-allocation' of github.com:varu…

4d60c3d

…nkatta/spark into executor-recovery-static-allocation

addressed latest review comments

01e8ec7

changed import order

5e1a143

tnachen reviewed Jun 6, 2017

View reviewed changes

varunkatta added 2 commits July 18, 2017 06:31

Addressed comments and got rid of an explicit lock object.

b5bd8d1

Fixed imports order.

1e2e49f

liyinan926 reviewed Jul 18, 2017

View reviewed changes

Merge k8s branch 'branch-2.1-kubernetes' into executor-recovery-stati…

1131d2c

…c-allocation

foxish reviewed Jul 19, 2017

View reviewed changes

mccheah reviewed Jul 20, 2017

View reviewed changes

Addressed review comments from Matt

4e75491

varunkatta added 2 commits July 21, 2017 11:19

Couple of style fixes

8acefef

Merge k8s branch 'branch-2.1-kubernetes' into executor-recovery-stati…

382278a

…c-allocation

foxish merged commit 4dfb184 into apache-spark-on-k8s:branch-2.1-kubernetes Jul 21, 2017

mccheah mentioned this pull request Aug 28, 2017

Unit Tests for KubernetesClusterSchedulerBackend #459

Merged

varunkatta mentioned this pull request Sep 6, 2017

Spark driver should exit and report a failure when all executors get killed/fail #134

Open

		import io.fabric8.kubernetes.client.{Config, ConfigBuilder, DefaultKubernetesClient}

		import okhttp3.Dispatcher


		private val executorRecoveryRunnable: Runnable = new Runnable {

		private val MAX_EXECUTOR_LOST_REASON_CHECKS = 10

Changes to support executor recovery behavior during static allocation. #244

Changes to support executor recovery behavior during static allocation. #244

Conversation

varunkatta commented Apr 26, 2017 • edited Loading

What changes were proposed in this pull request?

Reviewer notes

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varunkatta commented May 5, 2017 • edited Loading

Logs from testing:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varunkatta commented May 5, 2017

kimoonkim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish commented May 5, 2017 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish commented May 8, 2017

ash211 commented May 18, 2017

foxish commented May 22, 2017

Choose a reason for hiding this comment

erikerlandson commented Jun 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varunkatta commented Apr 26, 2017 •

edited

Loading

varunkatta commented May 5, 2017 •

edited

Loading