[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24394

tgravescs · 2019-04-17T16:40:07Z

What changes were proposed in this pull request?

Add in GPU and generic resource type allocation to the executors.

Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow :

users ask for a certain set of resources, for instance number of gpus - each cluster manager has a specific way to do this.
cluster manager allocates a container or set of resources (standalone mode)
When spark launches the executor in that container, the executor either has to be told what resources it has or it has to auto discover them.
Executor has to register with Driver and tell the driver the set of resources it has so the scheduler can use that to schedule tasks that requires a certain amount of each of those resources

In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node.

The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use.

Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615

How was this patch tested?

Added unit tests and manually tested.

…resources

tgravescs · 2019-04-17T16:43:13Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala


-      case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls, attributes) =>
+      case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls,
+          attributes, resources) =>


I realize this isn't used anywhere at this point, the follow on jiras for scheduler will use it, this seemed like a good point to split the functionality.

SparkQA · 2019-04-17T16:49:36Z

Test build #104669 has finished for PR 24394 at commit 916991e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-04-17T16:55:37Z

examples/src/main/resources/getGpuResources.sh

+# The script will return a string in the format: count:unit:comma-separated list of the resource addresses
+#
+
+ADDRS=`nvidia-smi --query-gpu=index --format=csv,noheader | sed 'N;s/\n/,/'`


I haven't referenced this script in any documentation yet, I think as part of this SPIP we should add some high level descriptions about how it all flows - SPARK-27492

SparkQA · 2019-04-17T17:55:41Z

Test build #104674 has finished for PR 24394 at commit 6ff9953.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-17T18:39:02Z

Test build #104678 has finished for PR 24394 at commit bee34a0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-04-17T19:06:09Z

looks like I need to build with mesos.

tgravescs · 2019-04-17T19:06:43Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

      |   --executor-id <executorId>
      |   --hostname <hostname>
      |   --cores <cores>
+      |   --resourceAddrs <rtype1=count:unit:addr1,addr2;rtype2=count:unit:r2addr1,r2addr2...>


thinking about this some more I think I should make this json format so its more extensible in the future and not as ugly on the command line.

tgravescs added 2 commits April 17, 2019 11:23

[SPARK-27024] Executor interface for cluster managers to support GPU …

5905f51

…resources

cleanup

916991e

tgravescs commented Apr 17, 2019

View reviewed changes

tgravescs mentioned this pull request Apr 17, 2019

[SPARK-27366][CORE] Support GPU Resources in Spark job scheduling #24374

Closed

tgravescs changed the title ~~Gpu sched executor clean~~ [SPARK-27024] Executor interface for cluster managers to support GPU Apr 17, 2019

tgravescs changed the title ~~[SPARK-27024] Executor interface for cluster managers to support GPU~~ [SPARK-27024] Executor interface for cluster managers to support GPU and other resources Apr 17, 2019

tgravescs commented Apr 17, 2019

View reviewed changes

fix style issues

6ff9953

add newline to test file

bee34a0

tgravescs commented Apr 17, 2019

View reviewed changes

tgravescs closed this Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24394

[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24394

Uh oh!

tgravescs commented Apr 17, 2019

Uh oh!

tgravescs Apr 17, 2019

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

tgravescs Apr 17, 2019

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

tgravescs commented Apr 17, 2019

Uh oh!

tgravescs Apr 17, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24394

[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24394

Uh oh!

Conversation

tgravescs commented Apr 17, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tgravescs Apr 17, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

tgravescs Apr 17, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

tgravescs commented Apr 17, 2019

Uh oh!

tgravescs Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tgravescs Apr 17, 2019 •

edited

Loading