[SPARK-27366][CORE] Support GPU Resources in Spark job scheduling #24374

jiangxb1987 · 2019-04-15T12:54:21Z

What changes were proposed in this pull request?

This PR adds support to schedule tasks with extra resource requirements (eg. GPUs) on executors with available resources. It also introduce a new method TaskContext.resources() so tasks can access available resource addresses allocated to them.

How was this patch tested?

Added new end-to-end test cases in SparkContextSuite;
Added new test case in CoarseGrainedSchedulerBackendSuite;
Added new test case in CoarseGrainedExecutorBackendSuite;
Added new test case in TaskSchedulerImplSuite;
Added new test case in TaskSetManagerSuite;
Updated existing tests.

jiangxb1987 · 2019-04-15T12:55:28Z

cc @tgravescs @squito @mengxr @cloud-fan

SparkQA · 2019-04-15T13:02:38Z

Test build #104586 has finished for PR 24374 at commit aa0d9ae.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ResourceInformation(
case class StatusUpdate(

srowen · 2019-04-15T13:54:37Z

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

Nit: "GPU" in all cases where it's referring to the hardware

srowen · 2019-04-15T13:55:48Z

core/src/main/scala/org/apache/spark/ResourceInformation.scala

What would units be here -- something like CUDA cores or GPU memory? below I just see "gpu" and "gpu.count" but there is already a separate count field.
Also, this doesn't account for type right? is that what the 'units' is supposed to help with?

for gpu the units would be empty. The idea is if things have a unit, like memory which has a unit you can put in GiB, MiB, etc.

Makes sense, and maybe I missed it, but are there docs or examples of this? does the script actually discover this information about anything? looks like it's just finding count and address of GPUs

the example script provided is just doing gpu's and units don't apply to gpu's so its just finding the addresses. I agree with you that we need some more docs and I'll comment the script better.

srowen · 2019-04-15T13:56:12Z

core/src/main/scala/org/apache/spark/ResourceInformation.scala

Just make this a case class? you don't need getters then

this is a user facing class and personally I think its better to have real class with getters as a more formal api for the user and potentially gives us ability to changes easier rather then having the parameters always public.

That's what a case class is though; it's just auto-generated. You can make parameters private if you want to not expose a getter.

thats true, guess was just thinking it was more flexible, but really can't think of anything that would need to be mutable here so I'll change.

Later, you can use its equals/== method and toString without having to write additional code too (or if needs to be customized, define it here)

srowen · 2019-04-15T13:58:41Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

"comma-separated"

srowen · 2019-04-15T13:59:42Z

core/src/main/scala/org/apache/spark/scheduler/SchedulerResourceInformation.scala

No big deal but how about just one method that can change the count by a positive or negative amount?

srowen · 2019-04-15T14:06:58Z

core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala

Does this one take value like "1m"?

srowen · 2019-04-15T14:07:42Z

core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala

Just to be tidy, can the members here be private? or use override if they override superclass methods

srowen · 2019-04-15T14:08:15Z

core/src/test/scala/org/apache/spark/scheduler/TaskDescriptionSuite.scala

As above, how about defining and equals method in ResourceInformation? you get this for free with a case class

srowen · 2019-04-15T14:08:45Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

Should the new resources arg to resourceOffer be optional, so that not all these callers have to pass a new empty thing?

srowen · 2019-04-15T14:09:18Z

docs/configuration.md

Do we need this for the driver too?

tgravescs · 2019-04-15T13:25:50Z

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

this is actually used on both executor and driver we should update comment.

tgravescs · 2019-04-15T13:47:50Z

core/src/main/scala/org/apache/spark/scheduler/SchedulerResourceInformation.scala

remove the extra logInfo messages here and below

tgravescs · 2019-04-15T13:58:56Z

core/src/test/scala/org/apache/spark/executor/ResourceDiscovererSuite.scala

remove extra line

tgravescs · 2019-04-15T14:02:11Z

core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala

not needed here

tgravescs · 2019-04-15T14:02:19Z

core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala

tgravescs · 2019-04-15T14:04:45Z

docs/configuration.md

we need to add in the spark.driver.resource.gpu.discoveryScript and spark.driver.resource.gpu.addresses here

tgravescs · 2019-04-15T14:05:07Z

docs/configuration.md

indentation looks off

tgravescs · 2019-04-15T14:07:24Z

examples/src/main/resources/getGpuResources.sh

add license header.

Also it might be good if we reference this script from some of the discoveryScript documentation

do cluster managers do something special so that with multiple executors, a script like this finds the right subset of gpus? I assume standalone does not, but maybe others do? That is worth mentioning somewhere as well (its also fine if that part of the story is in a later jira).

the script as is will find all GPU's visible to it, so if the cluster manager doesn't isolate it will find all on the host. Standalone mode, I would expect to use the --gpuDevices parameter to the Executor. It would be good to add a warning in the config description and here about that, or say the config isnt' supported in standalone mode.

tgravescs · 2019-04-15T14:45:00Z

This pr also contains more then what is for this particular jira. SPARK-27024 and SPARK-27374 We should either add those jira to the header here or split it apart. If its hard to split apart and others are ok with it, I'm ok with it.

Please note that I wrote a bunch of this code for the executor side and discovery changes, so someone else will have to officially approve this.

kiszk · 2019-04-15T15:38:29Z

core/src/main/scala/org/apache/spark/ResourceInformation.scala

Is this used anywhere?

viirya · 2019-04-15T15:02:05Z

core/src/main/scala/org/apache/spark/TaskContext.scala

In the doc, it might be better to explain what the keys are.

viirya · 2019-04-15T15:09:17Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

wrong indent?

viirya · 2019-04-15T15:15:47Z

core/src/main/scala/org/apache/spark/ResourceInformation.scala

Where is this used?

viirya · 2019-04-15T15:21:44Z

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

It reads assuming no gpu, thus seems it is proposed to return empty array, instead of just throwing SparkException?

viirya · 2019-04-15T15:30:27Z

core/src/main/scala/org/apache/spark/SparkConf.scala

Why do we need this requirement?

I think it would help if the exception was more detailed about how resources would get wasted, though its a bit of a pain to do that. (eg. 4 cores and 3 gpus means 1 core will always be idle, etc.)

viirya · 2019-04-15T15:32:09Z

core/src/main/scala/org/apache/spark/SparkContext.scala

Is this just for GPUs on the driver? Will it be used?

yes gpu's on the driver, I think the main use case is standalone mode or if someone doesn't have isolation. They can't just look on the host and take all the gpu's as they should only use the ones the cluster manager assigned to them.
Yes it will be by convention but its better thenwhat we have now

viirya · 2019-04-15T15:35:43Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

"gpu" -> ResourceInformation.GPU?

viirya · 2019-04-15T15:39:13Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

User specified GPU resource for task: $GPUS_PER_TASK, but can't find any GPU resources available on the executor.

viirya · 2019-04-15T15:57:14Z

docs/configuration.md

Remove extra space before This.

viirya · 2019-04-15T15:57:39Z

docs/configuration.md

This should return -> This script should return

viirya · 2019-04-15T16:11:01Z

core/src/test/scala/org/apache/spark/SparkConfSuite.scala

Based on the requirement at SparkConf, seems we don't allow to set(GPUS_PER_TASK.key, "1") for this setting, right? But I think it is a reasonable case.

kiszk · 2019-04-15T19:23:08Z

core/src/main/scala/org/apache/spark/SparkContext.scala

Why does this line use "gpu" instead of ResourceInformation.GPU?

squito

I didn't go through tests carefully yet, but otherwise just minor stuff, no major red flags

squito · 2019-04-16T01:20:19Z

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

maybe mention the config to use here?

squito · 2019-04-16T01:30:25Z

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

you don't need to import these since its the same package (sbt warns pretty loudly about this)

squito · 2019-04-16T01:30:57Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

is this used? if so, maybe better to say directly what is being imported as its an unusual import

it was being used for stringOf in the log statement to print the Array, could just use mkstring instead

squito · 2019-04-16T01:31:52Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

nit: map { ids =>

squito · 2019-04-16T01:41:03Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

I think I'm missing something, I dont' understand what this comment is referring to

squito · 2019-04-16T01:53:12Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

along the same lines -- how about removing availableResources from the parameter list entirely, and just have it get created inside ExecutorData from totalResources

squito · 2019-04-16T01:53:43Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

.foreach { r =>

squito · 2019-04-16T01:56:36Z

core/src/main/scala/org/apache/spark/SparkConf.scala

I think it would help if the exception was more detailed about how resources would get wasted, though its a bit of a pain to do that. (eg. 4 cores and 3 gpus means 1 core will always be idle, etc.)

squito · 2019-04-16T02:01:51Z

examples/src/main/resources/getGpuResources.sh

do cluster managers do something special so that with multiple executors, a script like this finds the right subset of gpus? I assume standalone does not, but maybe others do? That is worth mentioning somewhere as well (its also fine if that part of the story is in a later jira).

squito · 2019-04-16T02:04:42Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

I think its confusing that we call them "Indices" but they're strings, not ints. But then again, you also have code to check that they can parse as ints. Can this be made consistent? I think its preferable if we really know they are ints and can change all of the types from String to Int, and then use gpuIndices consistently. But if that is not the case, also OK to instead use gpuAddresses consistently.

mengxr · 2019-04-16T04:57:20Z

@jiangxb1987 I think we should reduce the scope of this PR:

Remove auto-discovery script and executor interface.
Do not consider gpu as a special hardcoded resource name.
Use static conf resource.accelerator_name.addresses to load accelerator addresses and resource.accelerator_name.count as request.

In this way, we can make this PR smaller and leave discovery and executor interface as follow-up work.

tgravescs · 2019-04-16T13:08:14Z

@mengxr your point 3 I think has to deal with the executor side and the configs, so if we split those it would apply there. I'm also not sure what you mean by using a static conf there, how do you have a static conf that isn't hardcoded? You would have to build it on the fly to put in the resource type/accelerator_name.

If we want to remove gpu as known (hardcoded) resource name then we should go all the way generic on the scheduler pieces. I was going to do that in a follow on PR but we could do it here as well if people prefer. Or are you just saying remove the hardcoded configs and still only do gpu for now?

tgravescs · 2019-04-16T13:53:18Z

I'll go ahead and try to split out the executor and discovery pieces from this. I think that may be easier to PR first on its own and then build the scheduler pieces next since it will need the resource information from the executor to make it decision. I'll try to apply everyone's comments above to that part and update here if it works out.

squito · 2019-04-16T15:32:42Z

One general thought I have -- there seems to be a lot of changes to do general resource tracking, though only gpus are supported here. These are all internal classes, so I'm wondering whether its useful to even put in those abstractions now. Is FPGA support (or whatever other special hardware) still years away? If nobody has at least experimented with it at all, are we sure that the generalizations you're putting in would even be useful in those cases?

I don't really know anything about other accelerators, so I don't have any strong feelings here, just a general concern about putting in abstractions too early. Just wanted to mention it, I'll leave it up to you.

tgravescs · 2019-04-16T15:46:56Z

I do know there is a company that sells a product to run Spark on FPGAs. I've also seen multiple talks on it. But I don't know how close it is for general user.

Personally I think it would be good to just do the generic thing as long as it doesn't add much overhead to the scheduler. I can see people wanting to add in possibly other information like GPU type, memory, etc. I can see a use for virtual GPU's if they want to share a GPU.

I was thinking if we put in the generic pieces people could more easily experiment without having to change core pieces.

squito · 2019-04-16T16:29:47Z

ok that makes sense, like I said I'm willing to trust your judgement on that. Might be helpful to reach out to that company to at least see if they see any red flags with this approach.

tgravescs · 2019-04-16T21:51:21Z

took me a bit longer to split the executor side stuff out and make fully generic, should have PR up tomorrow. My thought was I could put that up which gives the base for for @jiangxb1987 to do the scheduler changes on top of.

tgravescs · 2019-04-17T16:45:16Z

I submitted PR #24394 to replace this. @jiangxb1987 can we close this one?

thanks for everyone's reviews. I tried to apply everyone's comments from here that were applicable to that pr, if missed anything please comment there.

jiangxb1987 · 2019-04-17T17:00:15Z

@tgravescs Sure I'm closing this one, thanks!

tgravescs · 2019-04-18T15:49:29Z

sorry for any confusion, the executor pr is #24406

SparkQA · 2019-04-18T18:49:12Z

Test build #104716 has finished for PR 24374 at commit a251551.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-18T19:12:32Z

Test build #104717 has finished for PR 24374 at commit 40eaca0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-18T19:25:38Z

Test build #104720 has finished for PR 24374 at commit fe7ed1b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-18T21:53:24Z

Test build #104722 has finished for PR 24374 at commit 054806a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-03T06:13:21Z

Test build #105098 has finished for PR 24374 at commit 5d5bd46.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ResourceInformation(

…ackend

SparkQA · 2019-06-01T00:44:49Z

Test build #106038 has finished for PR 24374 at commit f388338.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-01T03:01:28Z

Test build #106040 has finished for PR 24374 at commit d15a51d.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-06-01T03:13:56Z

Test build #106042 has finished for PR 24374 at commit dcc147e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-06-03T13:18:35Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+          if (Utils.isTesting) {
+            throw new SparkException(message)
+          } else {
+            logWarning(message)


originally we talked about throwing here to not allow it, just want to make sure we intentionally changed our mind here? I'm really ok either way we go as there were some people questioning this on the Spip

Since we now have TaskSchedulerImpl.resourcesMeetTaskRequirements() to ensure there are enough resources before schedule a task, I think it's safe to just place a warning here.

I prefer a warning because the discovery script might return more and it is out of user's control. And available resources might not happen to be a multiple of task requested counts. For example, you have 32 CPU Cores and 3 GPUs.

tgravescs · 2019-06-03T13:35:58Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+        if (execCount.toInt / taskCount.toInt != numSlots) {
+          val message = s"The value of executor resource config: " +
+            s"${SPARK_EXECUTOR_RESOURCE_PREFIX + rName + SPARK_RESOURCE_COUNT_SUFFIX} " +
+            s"= $execCount is more than that tasks can take: $numSlots * " +


I don't think this is clear to the user what is wrong. its now really this ratio isn't the same as some other resources ratio.
Can we change this message to be more like:

The configuration of resource: rName (exec = X, task = y) will result in wasted resources due to resource $limitingResourceName limiting the # of runnable tasks per executor to: numslots. Please adjust your configuration.

tgravescs · 2019-06-03T14:04:29Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+    val name: String,
+    private val addresses: Seq[String]) extends Serializable {
+
+  private val addressesMap = new HashMap[String, Boolean]()


can we call this addressesAllocatedMap or similar

tgravescs · 2019-06-03T14:05:05Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+   */
+  def acquire(addrs: Seq[String]): Unit = {
+    addrs.foreach { address =>
+      val isAvailable = addressesMap.getOrElse(address, false)


can we rename isAvailable to isAssigned or vise versa to keep acquire and release consistent

When the address doesn't exists we may also want to throw an Exception. Added more comments to make it clear.

tgravescs · 2019-06-03T17:44:33Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+            task.resources.foreach { case (rName, rInfo) =>
+              availableResources(i).getOrElse(rName,
+                throw new SparkException(s"Try to acquire resource $rName that doesn't exist."))
+                .remove(0, rInfo.addresses.size)


I think this is worth a comment saying removing the first x elements which is be the same as we allocated in taskSet.resourceOffer since its synchronized (rather then the exact ones allocated)

SparkQA · 2019-06-04T07:05:03Z

Test build #106135 has finished for PR 24374 at commit cd01cae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-06-04T13:29:17Z

test this please

SparkQA · 2019-06-04T15:34:00Z

Test build #106152 has finished for PR 24374 at commit cd01cae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr

will check the tests later today

mengxr · 2019-06-04T15:47:17Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+      // large enough if any task resources were specified.
+      taskResourcesAndCount.foreach { case (rName, taskCount) =>
+        val execCount = executorResourcesAndCounts(rName)
+        if (execCount.toInt / taskCount.toInt != numSlots) {


9 / 4 == 2. Use taskCount.toInt * numSlots < execCount.toInt

mengxr · 2019-06-04T15:48:57Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+          if (Utils.isTesting) {
+            throw new SparkException(message)
+          } else {
+            logWarning(message)


I prefer a warning because the discovery script might return more and it is out of user's control. And available resources might not happen to be a multiple of task requested counts. For example, you have 32 CPU Cores and 3 GPUs.

mengxr · 2019-06-04T15:55:24Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+ */
+private[spark] class ExecutorResourceInfo(
+    val name: String,
+    private val addresses: Seq[String]) extends Serializable {


Remove private val. addresses doesn't need to be a member variable.

mengxr · 2019-06-04T15:55:57Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+    val name: String,
+    private val addresses: Seq[String]) extends Serializable {
+
+  private val addressesAllocatedMap = new HashMap[String, Boolean]()


Could you leave a TODO here to test OpenHashMap performance?

Rename addressesAllocatedMap to addressAvailabilityMap

mengxr · 2019-06-04T15:56:23Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.HashMap


Shall we only import mutable and use mutable.HashMap in code?

mengxr · 2019-06-04T16:13:46Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+    addrs.foreach { address =>
+      val isAvailable = addressesAllocatedMap.getOrElse(address, false)
+      if (isAvailable) {
+        addressesAllocatedMap(address) = false


Could you update the class ScalaDoc and mention that this class is intended to be used in a single thread?

mengxr · 2019-06-04T16:14:31Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+   */
+  def release(addrs: Seq[String]): Unit = {
+    addrs.foreach { address =>
+      val isAssigned = addressesAllocatedMap.getOrElse(address, true)


The isAssigned name is really confusing. It should be isAvailable.

Same here. Separate non-exist from assigned.

mengxr · 2019-06-04T16:19:23Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala

+   * Exposed for testing only.
+   */
+  private[scheduler] def assignedAddrs: Seq[String] =
+    addressesAllocatedMap.toList.filter(_._2 == false).map(_._1)


mengxr · 2019-06-04T16:34:18Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

      host: String,
-      maxLocality: TaskLocality.TaskLocality)
+      maxLocality: TaskLocality.TaskLocality,
+      availableResources: Map[String, Buffer[String]] = Map.empty)


Could you change the value type to Seq[String] since this method shouldn't change it?

mengxr · 2019-06-04T16:35:10Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

          executorDataMap.get(executorId) match {
            case Some(executorInfo) =>
              executorInfo.freeCores += scheduler.CPUS_PER_TASK
+              for ((k, v) <- resources) {


resources.foreach {

mengxr · 2019-06-04T16:47:53Z

core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala

+  private def serializeResources(map: immutable.Map[String, ResourceInformation],
+      dataOut: DataOutputStream): Unit = {
+    dataOut.writeInt(map.size)
+    for ((key, value) <- map) {


map.foreach

mengxr · 2019-06-04T16:48:46Z

core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala

+      dataOut.writeUTF(key)
+      dataOut.writeUTF(value.name)
+      dataOut.writeInt(value.addresses.size)
+      for (identifier <- value.addresses) {


value.addresses.foreach

mengxr · 2019-06-04T16:49:30Z

core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala

+      immutable.Map[String, ResourceInformation] = {
+    val map = new HashMap[String, ResourceInformation]()
+    val mapSize = dataIn.readInt()
+    for (i <- 0 until mapSize) {


use while instead of for

mengxr · 2019-06-04T16:49:41Z

core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala

+      val name = dataIn.readUTF()
+      val numIdentifier = dataIn.readInt()
+      val identifiers = new ArrayBuffer[String](numIdentifier)
+      for (j <- 0 until numIdentifier) {


mengxr · 2019-06-04T16:56:00Z

core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorData.scala

 * @param executorHost The hostname that this executor is running on
 * @param freeCores  The current number of cores available for work on the executor
 * @param totalCores The total number of cores available to the executor
+ * @param totalResources The information of all resources on the executor


It seems totalResources is only used in test. Shall we remove it and rename availableResources to resourceInfo? It carries all the info we need.

we can remove it now if you want, but it will be needed for the UI work but we can add it back in there if you want

resourceInfo is a superset, and extra info there is also needed by UI

Maybe we should move resourceInfo into ExecutorInfo ? Or we shall do it later when we consider the UI work for extra resources?

I would say we try to get this one merged and we can do UI separate and any changes needed there can be done there. Just remove totalResources for now

SparkQA · 2019-06-04T22:37:20Z

Test build #106164 has finished for PR 24374 at commit e539097.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-04T23:52:07Z

Test build #106167 has finished for PR 24374 at commit 82cd1e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-06-04T23:59:02Z

LGTM and merged into master. Thanks!

jiangxb1987 · 2019-06-05T00:02:32Z

Thanks very much! @tgravescs @mengxr @squito @srowen @viirya @kiszk

## What changes were proposed in this pull request? This PR adds support to schedule tasks with extra resource requirements (eg. GPUs) on executors with available resources. It also introduce a new method `TaskContext.resources()` so tasks can access available resource addresses allocated to them. ## How was this patch tested? * Added new end-to-end test cases in `SparkContextSuite`; * Added new test case in `CoarseGrainedSchedulerBackendSuite`; * Added new test case in `CoarseGrainedExecutorBackendSuite`; * Added new test case in `TaskSchedulerImplSuite`; * Added new test case in `TaskSetManagerSuite`; * Updated existing tests. Closes apache#24374 from jiangxb1987/gpu. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Xiangrui Meng <[email protected]>

srowen requested changes Apr 15, 2019

View reviewed changes

tgravescs reviewed Apr 15, 2019

View reviewed changes

kiszk reviewed Apr 15, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/ResourceInformation.scala Outdated

Copy link

Member

kiszk Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

viirya reviewed Apr 15, 2019

View reviewed changes

kiszk reviewed Apr 15, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/SparkContext.scala Outdated

Copy link

Member

kiszk Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this line use "gpu" instead of ResourceInformation.GPU?

squito reviewed Apr 16, 2019

View reviewed changes

jiangxb1987 closed this Apr 17, 2019

jiangxb1987 reopened this Apr 18, 2019

jiangxb1987 changed the title ~~[SPARK-27366][CORE] Support GPU Resources in Spark job scheduling~~ [WIP][SPARK-27366][CORE] Support GPU Resources in Spark job scheduling Apr 18, 2019

jiangxb1987 force-pushed the gpu branch from a251551 to 40eaca0 Compare April 18, 2019 19:01

jiangxb1987 force-pushed the gpu branch from 054806a to 5d5bd46 Compare May 3, 2019 03:57

jiangxb1987 added 8 commits May 31, 2019 17:39

add ExecutorResourceInfoSuite

41e9440

add test for CoarseGrainedExecutorBackend and CoarseGrainedSchedulerB…

d3f4a03

…ackend

update failed test cases

26126fc

update config check

bbac893

Only modify ExecutorResourceInfo inside SchedulerBackend.

10d5fab

add comments

7844e5c

update ExecutorResourceInfo

04fa380

update config check and comments

dcc147e

jiangxb1987 force-pushed the gpu branch from d15a51d to dcc147e Compare June 1, 2019 00:42

tgravescs reviewed Jun 3, 2019

View reviewed changes

update config check logic and update comments

cd01cae

mengxr requested changes Jun 4, 2019

View reviewed changes

jiangxb1987 added 2 commits June 4, 2019 13:35

code cleanup and update test cases

e539097

remove totalResources from ExecutorInfo

82cd1e3

mengxr approved these changes Jun 4, 2019

View reviewed changes

asfgit closed this in ac808e2 Jun 5, 2019


		package org.apache.spark.scheduler

		import scala.collection.mutable.HashMap

[SPARK-27366][CORE] Support GPU Resources in Spark job scheduling #24374

[SPARK-27366][CORE] Support GPU Resources in Spark job scheduling #24374

Uh oh!

Conversation

jiangxb1987 commented Apr 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jiangxb1987 commented Apr 15, 2019

Uh oh!

SparkQA commented Apr 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Apr 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jiangxb1987 commented Apr 15, 2019 •

edited

Loading