[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone #25047

Ngone51 · 2019-07-04T02:46:03Z

What changes were proposed in this pull request?

In this PR, we implements a complete process of GPU-aware resources scheduling
in Standalone. The whole process looks like: Worker sets up isolated resources
when it starts up and registers to master along with its resources. And, Master
picks up usable workers according to driver/executor's resource requirements to
launch driver/executor on them. Then, Worker launches the driver/executor after
preparing resources file, which is created under driver/executor's working directory,
with specified resource addresses(told by master). When driver/executor finished,
their resources could be recycled to worker. Finally, if a worker stops, it
should always release its resources firstly.

For the case of Workers and Drivers in client mode run on the same host, we introduce
a config option named spark.resources.coordinate.enable(default true) to indicate
whether Spark should coordinate resources for user. If spark.resources.coordinate.enable=false, user should be responsible for configuring different resources for Workers and Drivers when use resourcesFile or discovery script. If true, Spark would help user to assign different resources for Workers and Drivers.

The solution for Spark to coordinate resources among Workers and Drivers is:

Generally, use a shared file named allocated_resources.json to sync allocated
resources info among Workers and Drivers on the same host.

After a Worker or Driver found all resources using the configured resourcesFile and/or
discovery script during launching, it should filter out available resources by excluding resources already allocated in allocated_resources.json and acquire resources from available resources according to its own requirement. After that, it should write its allocated resources along with its process id (pid) into allocated_resources.json. Pid (proposed by @tgravescs) here used to check whether the allocated resources are still valid in case of Worker or Driver crashes and doesn't release resources properly. And when a Worker or Driver finished, normally, it would always clean up its own allocated resources in allocated_resources.json.

Note that we'll always get a file lock before any access to file allocated_resources.json
and release the lock finally.

Futhermore, we appended resources info in WorkerSchedulerStateResponse to work
around master change behaviour in HA mode.

How was this patch tested?

Added unit tests in WorkerSuite, MasterSuite, SparkContextSuite.

Manually tested with client/cluster mode (e.g. multiple workers) in a single node Standalone.

Ngone51 · 2019-07-04T02:49:26Z

cc @jiangxb1987 @mengxr @tgravescs @viirya @WeichenXu123

dongjoon-hyun · 2019-07-04T03:23:09Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+      val requests = parseAllResourceRequests(_conf, SPARK_DRIVER_PREFIX).map {req =>
+        req.id.resourceName -> req.amount
+      }.toMap
+      // TODO(wuyi) log driver's acquired resources separately ?


Hi, @Ngone51 . Please don't use user-id TODO in the patch. As you know, Apache Spark repository has already a few ancient user-id TODOs like this which is not fixed until now. :)
Since we don't know the future, let's use JIRA-IDed TODO like TODO(SPARK-XXX).

@dongjoon-hyun Thank you for reminding that. I'll fix those TODOs in following commits.

dongjoon-hyun · 2019-07-04T03:25:57Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

  }

+  def hasEnoughResources(resourcesFree: Map[String, Int], resourceReqs: Map[String, Int])
+  : Boolean = {


nit. indentation.

dongjoon-hyun · 2019-07-04T03:28:23Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

+        val delegate = brothers.head._2
+        delegate.endpoint.send(ReleaseResources(worker.resourcesCanBeReleased))
+      } else {
+        // TODO(wuyi5) cases here are hard to handle:


dongjoon-hyun · 2019-07-04T03:28:42Z

core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala

+   * @return
+   */
+  def acquireResources(resourceReqs: Map[String, Int])
+  : Map[String, Seq[String]] = {


nit. indentation.

dongjoon-hyun · 2019-07-04T03:29:36Z

core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala

+  def acquireResources(resourceReqs: Map[String, Int])
+  : Map[String, Seq[String]] = {
+    resourceReqs.map { case (rName, amount) =>
+      // TODO (wuyi) rName does not exists ?


ditto for (wuyi).

what does the comment mean? Why wouldn't rName exist?

dongjoon-hyun · 2019-07-04T03:34:20Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

-      case e: Exception =>
-        logError("Failed to create work directory " + workDir, e)
-        System.exit(1)
+    if (!Utils.createDirectory(workDir)) {


~~Ur, is it the same? Utils.createDirectory seems to work differently from workDir.mkdirs().~~
~~Or, do we need to change the current behavior of createWorkDir fo this PR?~~
Oops. I realized that this is a newly added function here in this PR.
I was confused because this overloaded the existing one. Only parameters are different.

dongjoon-hyun · 2019-07-04T03:39:34Z

core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala

+ * @param addresses Resource addresses provided by the executor/worker
+ */
+class ResourceAllocator(name: String, addresses: Seq[String]) extends Serializable {
+/**


indentation.

SparkQA · 2019-07-04T04:19:22Z

Test build #107211 has finished for PR 25047 at commit a9160e4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ReleaseResources(toRelease: Map[String, ResourceInformation]) extends DeployMessage

SparkQA · 2019-07-04T13:57:56Z

Test build #107229 has finished for PR 25047 at commit 73027b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-04T16:37:35Z

Test build #107243 has finished for PR 25047 at commit 3a54e16.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-05T02:17:57Z

Test build #107249 has finished for PR 25047 at commit d47c8db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-05T12:12:40Z

Test build #107276 has finished for PR 25047 at commit 0bf0d7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-05T17:37:31Z

Test build #107285 has finished for PR 25047 at commit d06985d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-07-08T15:35:33Z

I have a few general questions, note I haven't look at all of the code yet.

I'm not an expert in standalone mode but it supports both a client mode and a cluster mode. In your description are you saying even the client mode will use the resource file and lock it? How do you know the client is running on a node with GPU's or a worker? I guess as long as location is the same it doesn't matter. This is one thing in YARN we never have handled, in client mode the user is on their own for resource coordination.

It seems unreliable to assume you have multiple workers per node (for the case a worker crashes). When the worker dies it automatically kills any executors, correct? is there a chance it doesn't?
It feels like you really want something like Worker restart and recovery, meaning if a worker crashes and you restart it, it should have same id and discover what it had reserved before and possibly any executors still running. But that is probably a much bigger change. standalone uses a PID dir to tell what workers are running, correct? it could use this to track and check allocated resources. If you track the pid with the assignments if a new worker/driver starts there could check if there aren't enough resources for it to allocate, or based on what Master says, or both really, when new Worker comes up ask Master who is supposed to be there and then checks process existence.

SparkQA · 2019-08-06T13:20:45Z

Test build #108714 has finished for PR 25047 at commit b46b243.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-08-06T13:29:40Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

  private[spark] val SPARK_TASK_PREFIX = "spark.task"

  private[spark] val SPARK_RESOURCES_COORDINATE =
    ConfigBuilder("spark.resources.coordinate.enable")


did you mean to turn this to false? If so we need to update the configuration.md to match. I'm ok either way

Oops. I misunderstand your comment. I'd prefer it to be true. And tests' failures are probably affected by this change.

tgravescs · 2019-08-06T13:34:48Z

test this please

tgravescs · 2019-08-06T13:39:14Z

core/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala

+    }
+  }
+
+  test("Workers should avoid resources conflict when launch from the same host") {


would be nice to add a test with the SPARK_RESOURCES_COORDINATE off to make sure all the resources from file/discovery returned properly

tgravescs · 2019-08-06T15:02:07Z

docs/spark-standalone.md

+  <td>(none)</td>
+  <td>
+    Path to resources file which is used to find various resources while worker starting up.
+    The content of resources file should be formatted like the <code>ResourceAllocation</code> class.


I just realized this is supposed to be an array of ResourceAllocations

SparkQA · 2019-08-06T15:04:11Z

Test build #108721 has finished for PR 25047 at commit b46b243.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-06T17:52:16Z

Test build #108725 has finished for PR 25047 at commit 16523bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-08-07T14:32:48Z

@Ngone51 can you remove the [WIP] from the description. I think this is really close was going to make one more pass through but things look good.

Ngone51 · 2019-08-07T15:29:16Z

@tgravescs Thanks for reminding this. Have updated title and description.

tgravescs · 2019-08-07T21:54:26Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

-          .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
-            worker.coresFree >= coresPerExecutor)
+          .filter(canLaunchExecutor(_, app.desc))
          .sortBy(_.coresFree).reverse


We have an issue here if no Workers have the resources required by the executor/task requirements, then it doesn't warn/error and it doesn't retry. Basically I started a Worker without GPUs and then said I need gpus for my executor task and it end up hanging. I suppose one could argue this is ok as someone could start another Worker that has the resources, but I think we at least need to Warn about it

Actually, it does retry when other executors or drivers finish. But, we can warn if executor or driver requires more resources than any of workers could have. BTW, I'm thinking do we have the same issue for memory and cores ? For example, a Worker has 10 cores at most while an executor ask for 20 cores ?

Right if something changes (other app, other workers, etc) it retries, but if I'm the only app on the cluster its not clear why the app isn't launching. The one thing I don't want is it to be to noisy though either. I thought about that before making the comment, because like you said if its just out of resources because other apps are running we don't really want to print anything. I think for now we should just limit it to resources and perhaps just say no Workers are configured with the resources you requested. If we can do that without much performance impact lets do it. If not maybe we just file a separate jira for it and look at it there

How about this way?

for (app <- waitingApps) { ... val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) .filter(canLaunchExecutor(_, app.desc)) .sortBy(_.coresFree).reverse if (waitingApps.size == 1 && usableWorkers.isEmpty) { logWarn("The app requires more resources(mem, core, accelerator) than any of Workers could have.") } ... }

Telling "the Workers are not configured with the resources(I mean accelerator) as an app requested" may require more changes. For example, you may need to traversal workers again to judge whether it's due to resources(I mean accelerator) or memory or cores. Or, you need to refactor canLaunchExecutor to tell more details.

that sounds good for now. Lets also leave it called resource since that is what its called everywhere right now. just leave off the (mem, core, accelerator) part.

tgravescs · 2019-08-07T22:01:07Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

-        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
+        if (canLaunchDriver(worker, driver.desc)) {
+          val allocated = worker.acquireResources(driver.desc.resourceReqs)
+          driver.withResources(allocated)


I'm assuming driver could also not launch if worker doesn't have any GPUs and it requests them, may want to warn here as well

tgravescs · 2019-08-07T22:33:24Z

core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala

 */
-private[spark] case class ResourceAllocation(id: ResourceID, addresses: Seq[String]) {
+@Evolving
+case class ResourceAllocation(id: ResourceID, addresses: Seq[String]) {


Sorry I just realized we made this public but REsourceID is private still. lets make this private again and just put the format in the docs like you had it before. Again sorry for switching on you here. If we end up thinking users will find this useful then we can open it up to be public later.

never mind.

…could have

SparkQA · 2019-08-09T04:54:06Z

Test build #108858 has finished for PR 25047 at commit 15a9897.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

looks good, thanks @Ngone51

Ngone51 · 2019-08-10T08:27:56Z

Thank you for your help @tgravescs

Ngone51 added 2 commits July 2, 2019 11:07

support gpu-aware resource scheduling in standalone

8984026

sync resources info on worker node

a9160e4

dongjoon-hyun reviewed Jul 4, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Jul 4, 2019

Ngone51 added 4 commits July 4, 2019 19:19

fix unit tests

4787498

fix indentation

6bee4ac

close lockFileChannel after release lock

30991b0

revert createAppDesc()

73027b3

Ngone51 added 2 commits July 5, 2019 00:26

release resources on worker interrupt

03c0921

reuse parseResourceRequirements()

3a54e16

fix scalastyle

d47c8db

only release resources for driver submitted in client mode

0bf0d7e

Ngone51 added 4 commits July 5, 2019 21:44

release lock on execption

09c13af

avoid NPE on listernerBus in SparkContext.stop()

863b220

log isolated resources for worker/driver

acfed50

set _resources to empty when acquireResources failed

d06985d

Ngone51 added 5 commits August 6, 2019 19:06

remove resourcesCanBeReleased()

250d9a8

remove pid in WorkerInfo & RegisterWorker

a97c91f

improve doc

adc74ae

fix scalastyle

f121f84

merge master

b46b243

tgravescs reviewed Aug 6, 2019

View reviewed changes

make coordinate default to be true

aa2f63d

tgravescs reviewed Aug 6, 2019

View reviewed changes

Ngone51 added 3 commits August 6, 2019 23:22

add test for coordinate is off

ade35fc

array of ResourceAllocation

2bb50da

update comment for test

16523bd

Ngone51 changed the title ~~[WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone~~ [SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone Aug 7, 2019

tgravescs reviewed Aug 7, 2019

View reviewed changes

Ngone51 added 2 commits August 8, 2019 23:25

revert doc for ResourceAllocation

33fcc95

log warn when driver/executor require more resource than any workers …

15a9897

…could have

tgravescs approved these changes Aug 9, 2019

View reviewed changes

asfgit closed this in cbad616 Aug 9, 2019

Ngone51 mentioned this pull request Jun 8, 2020

[SPARK-31921][CORE] Fix the wrong warning: "App app-xxx requires more resource than any of Workers could have" #28742

Closed

[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone #25047

[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone #25047

Uh oh!

Conversation

Ngone51 commented Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Ngone51 commented Jul 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 4, 2019

Uh oh!

SparkQA commented Jul 4, 2019

Uh oh!

SparkQA commented Jul 4, 2019

Uh oh!

SparkQA commented Jul 5, 2019

Uh oh!

SparkQA commented Jul 5, 2019

Uh oh!

SparkQA commented Jul 5, 2019

Uh oh!

tgravescs commented Jul 8, 2019

Uh oh!

SparkQA commented Aug 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Aug 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 6, 2019

Uh oh!

SparkQA commented Aug 6, 2019

Uh oh!

tgravescs commented Aug 7, 2019

Uh oh!

Ngone51 commented Aug 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Ngone51 commented Jul 4, 2019 •

edited

Loading

dongjoon-hyun Jul 4, 2019 •

edited

Loading

Ngone51 Aug 8, 2019 •

edited

Loading

tgravescs Aug 7, 2019 •

edited

Loading