-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls, attributes) => | ||
| case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls, | ||
| attributes, resources) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this isn't used anywhere at this point, the follow on jiras for scheduler will use it, this seemed like a good point to split the functionality.
|
Test build #104669 has finished for PR 24394 at commit
|
| # The script will return a string in the format: count:unit:comma-separated list of the resource addresses | ||
| # | ||
|
|
||
| ADDRS=`nvidia-smi --query-gpu=index --format=csv,noheader | sed 'N;s/\n/,/'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't referenced this script in any documentation yet, I think as part of this SPIP we should add some high level descriptions about how it all flows - SPARK-27492
|
Test build #104674 has finished for PR 24394 at commit
|
|
Test build #104678 has finished for PR 24394 at commit
|
|
looks like I need to build with mesos. |
| | --executor-id <executorId> | ||
| | --hostname <hostname> | ||
| | --cores <cores> | ||
| | --resourceAddrs <rtype1=count:unit:addr1,addr2;rtype2=count:unit:r2addr1,r2addr2...> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about this some more I think I should make this json format so its more extensible in the future and not as ugly on the command line.
What changes were proposed in this pull request?
Add in GPU and generic resource type allocation to the executors.
Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow :
In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node.
The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use.
Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615
How was this patch tested?
Added unit tests and manually tested.