This plugin adds support for GPUs with MIG on YARN. The built-in YARN GPU plugin does not support MIG enabled GPUs. This plugin also works with GPUs without MIG or GPUs with MIG disabled but the limitation section still applies. It supports heterogenous environments where there may be some MIG enabled GPUs and some without MIG. If you are not using MIG enabled GPUs, you should use the built-in YARN GPU plugin.
It works with Apache YARN 3.3.0+ versions that support the Pluggable Device Framework. This plugin requires YARN to be configured with cgroups and Nvidia Docker runtime v2.
Please see the MIG Application Considerations and CUDA Device Enumeration.
It is important to note that CUDA 11 only supports enumeration of a single MIG instance. This means that this plugin only supports 1 GPU per container and the plugin will throw an exception by default if you request more. It is recommended that you configure YARN to only allow a single GPU be requested. See the yarn config:
yarn.resource-types.nvidia/miggpu.maximum-allocation
See YARN Resource Configuration for more details. If you do not configure the maximum allocation and someone requests multiple GPUs, the default behavior is to throw an exception. The user visible exception is not very useful, as the real exception will be in the nodemanager logs. See the Configuration section for options if it throws an exception.
mvn package
This will create a jar target/yarn-gpu-mig-plugin-1.0.0.jar
. This jar can be installed on your YARN cluster as a plugin.
These instructions assume YARN is already installed and configured with cgroups enabled and Nvidia Docker runtime v2. Enable and configure your GPUs with MIG on all of the nodes it applies to.
Install the jar into your Hadoop Cluster, see the Test and Use Your Own Plugin
section. This recommends installing it in something like $HADOOP_COMMOND_HOME/share/hadoop/yarn
.
Configure the device plugin, see the YARN documentation on Pluggable Device Framework.
After enabling the framework, enable the plugin in yarn-site.xml
:
<property>
<name>yarn.nodemanager.pluggable-device-framework.device-classes</name>
<value>com.nvidia.spark.NvidiaGPUMigPluginForRuntimeV2</value>
</property>
Configure YARN to have the new resource type by modifying the resource-types.xml
file to include:
<property>
<name>yarn.resource-types</name>
<value>nvidia/miggpu</value>
</property>
Restart YARN to pick up any configuration changes.
To change the behavior of throwing when the user allocates multiple GPUs, you can either set a config in the yarn-site.xml
or set
an environment variable when launching the Spark application. The environment variable will take precendence if both are set.
In either case, true
means to throw if a user requests multiple GPUs (this is the default), false
means it won't throw and if the container is allocated with multiple MIG devices from the same
GPU, it is up to the application to know how to use them.
Config for yarn-site.xml
:
<property>
<name>com.nvidia.spark.NvidiaGPUMigPluginForRuntimeV2.throwOnMultipleGPUs</name>
<value>true</value>
</property>
Environment variable for Spark application:
--conf spark.executorEnv.NVIDIA_MIG_PLUGIN_THROW_ON_MULTIPLE_GPUS=true
Spark supports scheduling GPUs and other custom resources on YARN. There are 2 options for using this plugin with Spark to allocate GPUs with MIG support:
- Use Spark 3.2.1 or newer and remap the standard Spark
gpu
resource (i.e.:spark.executor.resource.gpu.amount
) to be the new MIG GPU resource type using:
--conf spark.yarn.resourceGpuDeviceName=nvidia/miggpu
This means users don't have to change their configs if they were already using the gpu
resource type.
- Spark applications specify the
nvidia/miggpu
resource type instead of thegpu
resource type. For this the user has to change the resource type tonvidia/miggpu
, update the discovery script, and specify an extra YARN config(spark.yarn.executor.resource.nvidia/miggpu.amount
). The command would be something like below (update the amounts according to your setup):
--conf spark.executor.resource.nvidia/miggpu.amount=1 --conf spark.executor.resource.nvidia/miggpu.discoveryScript=./getMIGGPUs --conf spark.task.resource.nvidia/miggpu.amount=0.25 --files ./getMIGGpus --conf spark.yarn.executor.resource.nvidia/miggpu.amount=1
Note the getMIGGpus discovery script would is in the scripts
directory in this repo. It just changes the resource name returned to match
nvidia/miggpu
.
Run a Spark application using the Rapids Accelerator for Apache Spark and request GPUs from YARN and verify they use the MIG enabled GPUs.