Using single GPU with multiple containers #123

ndesh26 · 2019-09-23T04:37:44Z

I was able to find some unofficial support for this in case of the Nvidia plugin. Is there similar support for this plugin?

I'm interested in this feature and would be willing to implement it. I want to know if there are other people who would be interested in such a feature. And also if the maintainers of this repo would like to add such a feature.

thomas-riccardi · 2019-09-26T10:14:33Z

Porting gpushare-device-plugin (and gpushare-scheduler-extender if it needs adaptation) to GKE would be much useful to us!
(This is the next step after our simple GPU sharing fork of container-engine-accelerators).

danisla · 2019-09-27T17:58:55Z

There is a simple way to do GPU sharing on GKE by creating symlinks on the host filesystem to the /dev/nvidia0 device. The plugin will pickup these new devices and register them with the node. There is no advanced scheduling, but it does let you attach the same GPU to multiple containers.

Here is an example I use in a daemonset to add the symlinks to the node after the NVIDIA driver has been installed:

# Create symlinks to NVIDIA device to support GPU sharing.
NVIDIA_0_SHARE=16

OLD_DEV=/dev/nvidia0
for i in $(seq 1 $(($NVIDIA_0_SHARE - 1))); do
  NEW_DEV=/dev/nvidia${i}
  echo "Linking $OLD_DEV -> $NEW_DEV"
  ln -sf $OLD_DEV $NEW_DEV
done

ndesh26 · 2019-09-30T07:28:34Z

@danisla That's a nice, simple solution for a use case with basic scheduling.

HenriTEL · 2019-10-04T14:20:03Z

Another trick would be to make your node RAM match the GPU memory.
Assuming your pods restrict their GPU memory limit themself you can schedule using resources.requests.memory.

ide8 · 2019-10-15T11:39:11Z

@danisla, could you please clarify where exactly this bash script should be applied?
It doesn't work in case when I run it strictly on the worker node.

danisla · 2019-10-15T17:19:18Z

@ide8 here is a DaemonSet that I apply to create the symlinks on new nodes.

ide8 · 2019-10-16T12:54:44Z

@danisla , thank you but seems that it doesn't work with EKS
Trying to figure it out

ndesh26 · 2019-10-16T13:26:05Z

@HenriTEL That would work when we just have a single node. But in case of multiple nodes having GPUs with different memories, it will become increasingly difficult to ensure proper scheduling and that we are not overcommitting our resources.

@ALL I will try to create a PoC for it since there are enough people interested in this. I will create a pull here when I'm done.

ide8 · 2019-10-16T14:41:38Z

@ndesh26 , in our case it doesn't work even with a single node, with and without nvidia device plugin, but that's maybe specific to EKS.

ndesh26 · 2019-10-17T01:14:22Z

@ide8 I was able to successfully run @danisla's solution on our system, your problem might be specific to EKS.

HenriTEL · 2019-10-17T10:11:43Z

@ide8 Note that the nvidia-device-plugin deployed by gke does not use nvidia-docker, it mounts devices in the containers directly.
I apply this DaemonSet to install nvidia-docker and set nvidia as default runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using single GPU with multiple containers #123

Using single GPU with multiple containers #123

ndesh26 commented Sep 23, 2019

thomas-riccardi commented Sep 26, 2019

danisla commented Sep 27, 2019

ndesh26 commented Sep 30, 2019

HenriTEL commented Oct 4, 2019 •

edited

Loading

ide8 commented Oct 15, 2019 •

edited

Loading

danisla commented Oct 15, 2019

ide8 commented Oct 16, 2019

ndesh26 commented Oct 16, 2019

ide8 commented Oct 16, 2019 •

edited

Loading

ndesh26 commented Oct 17, 2019

HenriTEL commented Oct 17, 2019 •

edited

Loading

Using single GPU with multiple containers #123

Using single GPU with multiple containers #123

Comments

ndesh26 commented Sep 23, 2019

thomas-riccardi commented Sep 26, 2019

danisla commented Sep 27, 2019

ndesh26 commented Sep 30, 2019

HenriTEL commented Oct 4, 2019 • edited Loading

ide8 commented Oct 15, 2019 • edited Loading

danisla commented Oct 15, 2019

ide8 commented Oct 16, 2019

ndesh26 commented Oct 16, 2019

ide8 commented Oct 16, 2019 • edited Loading

ndesh26 commented Oct 17, 2019

HenriTEL commented Oct 17, 2019 • edited Loading

HenriTEL commented Oct 4, 2019 •

edited

Loading

ide8 commented Oct 15, 2019 •

edited

Loading

ide8 commented Oct 16, 2019 •

edited

Loading

HenriTEL commented Oct 17, 2019 •

edited

Loading