[WIP][SPARK-24375][Prototype] Support barrier scheduling #21494

jiangxb1987 · 2018-06-04T23:52:52Z

What changes were proposed in this pull request?

This PR is to add new RDDBarrier and BarrierTaskContext to support barrier scheduling in Spark. It also modifies how the job scheduling works to accommodate the new feature.

Note: this is a prototype to facilitate the discussion. It's not meant for the final design or anything. It just shows one way that might works.

How was this patch tested?

Simple unit test and integration test.

SparkQA · 2018-06-05T04:58:03Z

Test build #91471 has finished for PR 21494 at commit 84cdc68.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2018-06-06T14:49:20Z

Hi, @jiangxb1987 , can you explain more for what is barrier scheduling in spark and elaborate an example which would only works with barrier scheduling( but could not work under current spark schedule mechanism) for better understanding ?

jiangxb1987 · 2018-06-06T17:29:28Z

@Ngone51 You can refer to the SPIP that xiangrui proposed in SPARK-24374 for a basic background and major goal of barrier scheduling, and you can also refer to SPARK-24375 for a design sketch. If you have further comments please feel free to talk on the JIRA (recommended because that works better for something we may want to revisit later) or here :)

galv

I haven't understood everything yet. I'll have to return to this later to review fully.

galv · 2018-06-06T01:33:59Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+      val tc = TaskContext.get.asInstanceOf[org.apache.spark.barrier.BarrierTaskContext]
+      // If we don't get the expected taskInfos, the job shall abort due to stage failure.
+      if (tc.hosts().length != 2) {
+        throw new SparkException("Expected taksInfos length is 2, actual length is " +


taksInfos -> taskInfos

galv · 2018-06-06T01:34:10Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+        throw new SparkException("Expected taksInfos length is 2, actual length is " +
+          s"${tc.hosts().length}.")
+      }
+      // println(tc.getTaskInfos().toList)


Remove comment

galv · 2018-06-06T04:38:32Z

python/pyspark/worker.py

        shuffle.DiskBytesSpilled = 0
        _accumulatorRegistry.clear()

+        if (isBarrier):


Style: if (isBarrier): -> if isBarrier:

galv · 2018-06-06T04:43:38Z

python/pyspark/worker.py


+        if (isBarrier):
+            port = 25333 + 2 + 2 * taskContext._partitionId
+            paras = GatewayParameters(port=port)


paras -> params

galv · 2018-06-06T04:49:53Z

python/pyspark/worker.py

        _accumulatorRegistry.clear()

+        if (isBarrier):
+            port = 25333 + 2 + 2 * taskContext._partitionId


I recommend using DEFAULT_PORT and DEFAULT_PYTHON_PORT. They are exposed as part of the public API of py4j: https://github.com/bartdag/py4j/blob/216432d859de41441f0d1a0d55b31b5d8d09dd28/py4j-python/src/py4j/java_gateway.py#L54

By the way, acquiring ports like this is a little hacky and may require more thought.

galv · 2018-06-06T20:54:49Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

  // TODO: We should kill any running task attempts when the task set manager becomes a zombie.
  private[scheduler] var isZombie = false

+  private[scheduler] lazy val barrierCoordinator = {


I recommend adding a return type here for readability.

+1 @galv
We also have barrierCoordinator with type RpcEndpointRef at each TaskContext, so it's better to add return type for both.

viirya · 2018-06-07T06:48:41Z

core/src/main/scala/org/apache/spark/barrier/BarrierTaskContext.scala

+import org.apache.spark.metrics.MetricsSystem
+import org.apache.spark.util.RpcUtils
+
+class BarrierTaskContext(


BarrierTaskContextImpl?

viirya · 2018-06-07T07:07:29Z

core/src/main/scala/org/apache/spark/barrier/BarrierCoordinator.scala

+    case IncreaseEpoch(previousEpoch) =>
+      if (previousEpoch == epoch) {
+        syncRequests.foreach(_.sendFailure(new RuntimeException(
+          s"The coordinator cannot get all barrier sync requests within $timeout ms.")))


Have we considered to increase incrementally the time out when we can't get all barrier sync requests at an epoch?

viirya · 2018-06-07T07:18:59Z

core/src/main/scala/org/apache/spark/barrier/BarrierCoordinator.scala

+
+        syncRequests += context
+        replyIfGetAllSyncRequest()
+      }


if (epoch == this.epoch) { ... } else { // Received RpcCallContext from failed previousEpoch. context.sendFailure(new RuntimeException( s"The coordinator cannot get all barrier sync requests within $timeout ms."))) }

viirya · 2018-06-07T07:23:35Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+          taskScheduler.cancelTasks(stageId, interruptThread = false)
+        } catch {
+          case e: UnsupportedOperationException =>
+            logInfo(s"Could not cancel tasks for stage $stageId", e)


Under barrier execution, will it be a problem if we can not cancel tasks?

viirya · 2018-06-07T07:28:18Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

-        taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
+      // Skip the barrier taskSet if the available slots are less than the number of pending tasks.
+      if (taskSet.isBarrier && availableSlots < taskSet.numTasks) {
+        // Skip the launch process.


Logging something instead of silently passing?

is there a way to propagate info to scheduler and resource manager layer for preempt scheduling

viirya · 2018-06-07T08:03:26Z

core/src/main/scala/org/apache/spark/barrier/BarrierCoordinator.scala

+          timer.schedule(new TimerTask {
+            override def run(): Unit = {
+              // self can be null after this RPC endpoint is stopped.
+              if (self != null) self.send(IncreaseEpoch(currentEpoch))


Once this epoch fails to sync, the stage will be failed and resubmitted. I think it will begin from new task set, so IncreaseEpoch seems useless because it doesn't really increase epoch?

register task level barriers sequence and hierarchy may be?

felixcheung · 2018-06-08T20:09:25Z

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

        // Write out the TaskContextInfo
+        val isBarrier = context.isInstanceOf[BarrierTaskContext]
+        dataOut.writeBoolean(isBarrier)
+        if (isBarrier) {


so this would be language dependent? would need something for R runner too?

yanboliang · 2018-06-08T20:28:27Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

  // TODO: We should kill any running task attempts when the task set manager becomes a zombie.
  private[scheduler] var isZombie = false

+  private[scheduler] lazy val barrierCoordinator = {


+1 @galv
We also have barrierCoordinator with type RpcEndpointRef at each TaskContext, so it's better to add return type for both.

yanboliang · 2018-06-08T20:46:20Z

core/src/main/scala/org/apache/spark/barrier/BarrierCoordinator.scala

+    timeout: Long,
+    override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint {
+
+  private var epoch = 0


Will epoch value be logged on driver and executors? It should be useful to diagnose upper level MPI program.

galv · 2018-06-08T05:04:11Z

core/src/main/scala/org/apache/spark/util/RpcUtils.scala

  /** Returns the default Spark timeout to use for RPC ask operations. */
  def askRpcTimeout(conf: SparkConf): RpcTimeout = {
-    RpcTimeout(conf, Seq("spark.rpc.askTimeout", "spark.network.timeout"), "120s")
+    RpcTimeout(conf, Seq("spark.rpc.askTimeout", "spark.network.timeout"), "900s")


Why hard-code this change? Couldn't you have set this at runtime if you needed it increased? I'm concerned about it breaking backwards compatibility with jobs that for whatever reason depend on the 120 second timeout.

galv · 2018-06-08T05:05:07Z

core/src/main/scala/org/apache/spark/barrier/BarrierRDD.scala

+
+
+/**
+ * An RDD that supports running MPI programme.


programme -> program

jiangxb1987 · 2018-07-31T04:26:21Z

Close this in favor of #21758 and #21898 , thanks for your comments! I hope they're addressed in the new code.

jiangxb1987 added 2 commits June 4, 2018 16:37

Implement support for barrier scheduling

6bdc5fe

add TODOs

84cdc68

galv reviewed Jun 7, 2018

View reviewed changes

viirya reviewed Jun 7, 2018

View reviewed changes

felixcheung reviewed Jun 8, 2018

View reviewed changes

yanboliang reviewed Jun 8, 2018

View reviewed changes

galv reviewed Jun 11, 2018

View reviewed changes

jiangxb1987 closed this Jul 31, 2018

[WIP][SPARK-24375][Prototype] Support barrier scheduling #21494

[WIP][SPARK-24375][Prototype] Support barrier scheduling #21494

Uh oh!

Conversation

jiangxb1987 commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 5, 2018

Uh oh!

Ngone51 commented Jun 6, 2018

Uh oh!

jiangxb1987 commented Jun 6, 2018

Uh oh!

galv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jul 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jiangxb1987 commented Jun 4, 2018 •

edited

Loading