Feature Request: Support multiple GPU driver versions in one k8s cluster #323

khatrig · 2022-02-16T15:43:39Z

1. feature description

Currently, the gpu-operator only supports one driver version across the cluster. It'd be great if each GPU node could have a different driver version based on the requirement.

Eg. There are two GPU nodes A and B in a k8s cluster.
Node A could have driver version X and node B could have driver version Y.

shivamerla · 2022-02-25T08:31:38Z

Thanks for the feature request. This will indeed be a great feature. Currently only way this can be done is with pre-installed drivers on the host with GPU operator. We would need to introduce nvidia.com/gpu.deploy.driver: <version> labels to achieve this and a Daemonset for each version needs to be created. We would need to create a Config/CustomResource for users to provide this mapping of <node_label> to <driver_version>, where node label can be based on GPU types etc. Supporting heterogenous nodes in the cluster is in the roadmap and this will be handled along with it.

khatrig · 2022-02-27T14:06:36Z

Thanks for the feature request. This will indeed be a great feature. Currently only way this can be done is with pre-installed drivers on the host with GPU operator. We would need to introduce nvidia.com/gpu.deploy.driver: <version> labels to achieve this and a Daemonset for each version needs to be created. We would need to create a Config/CustomResource for users to provide this mapping of <node_label> to <driver_version>, where node label can be based on GPU types etc. Supporting heterogenous nodes in the cluster is in the roadmap and this will be handled along with it.

@shivamerla Thanks for considering this.
Another way of doing this could be, A driver version controller (could be a Kubernetes deployment) that understands label nvidia.com/gpu.deploy.driver: added on GPU nodes. You add a specific label on the nodes and the driver version controller triggers the driver installation process on GPU nodes. This way it might be possible to avoid running multiple DaemonSets.

shivamerla · 2022-03-10T21:42:56Z

@khatrig, we currently package a single driver version into each image, hence the requirement to have separate daemonsets.

khatrig · 2022-03-11T02:44:37Z

I understand now what you meant earlier. Thanks for the explanation.

glotzerhotze · 2022-03-31T11:58:22Z

@shivamerla

If this feature is developed, could we add a sanity check for deploying the latest driver compatible with the hardware found on a node?

For example:
The latest upgrade from 1.9.1 to 1.10.0 brought driver version 510.x.y - which dropped support for Kepler architecture. Thus when running a K80 device on a node, building the 510.x.y driver would fail with "no hardware found" error.

It would be nice if the operator would sanity-check the current devices on a node and select the latest supported driver auto-magically, preventing breakage that we currently see in such an use-case.

Thanks

takyon77 · 2022-06-23T13:51:12Z

@shivamerla I would also be interested for having different drivers. We are currently trying to include in the same cluster different GPUs (A100 and K40m) and I can't find a single driver working for both (latest 515 is working for A100 but not for K40m and latest 470 is working for K40m but doesn't see the device on A100)
Any guestimation on when this feature could be available?

bryantbiggs · 2023-05-26T12:40:48Z

wouldn't this be possible today through different nodegroups with the appropriate labels/taints? it would require multiple daemonsets to be deployed, but it seems like it could work - no?

A100 nodegroup - tainted with gpuVer=A100:NoSchedule, daemonset deployed with toleration for said taint
K40m nodegroup - tainted with gpuVer=K40m:NoSchedule, daemonset deployed with toleration for said taint

khatrig · 2023-11-01T17:41:21Z

Glad to see that gpu-operator now has a feature(tech preview) that makes it possible to run multiple driver versions in the same cluster.

Closing.

changhyuni · 2024-06-11T02:55:19Z

@khatrig
I've been looking at this feature, but doesn't this mean that we have to partition the worker nodes by driver version?

rakataprime mentioned this issue Jun 7, 2023

Adding TorchBench SDL / Container Example for Gpu Benchmarking akash-network/awesome-akash#387

Open

khatrig closed this as completed Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support multiple GPU driver versions in one k8s cluster #323

Feature Request: Support multiple GPU driver versions in one k8s cluster #323

khatrig commented Feb 16, 2022

shivamerla commented Feb 25, 2022

khatrig commented Feb 27, 2022

shivamerla commented Mar 10, 2022

khatrig commented Mar 11, 2022

glotzerhotze commented Mar 31, 2022

takyon77 commented Jun 23, 2022

bryantbiggs commented May 26, 2023

khatrig commented Nov 1, 2023

changhyuni commented Jun 11, 2024

Feature Request: Support multiple GPU driver versions in one k8s cluster #323

Feature Request: Support multiple GPU driver versions in one k8s cluster #323

Comments

khatrig commented Feb 16, 2022

1. feature description

shivamerla commented Feb 25, 2022

khatrig commented Feb 27, 2022

shivamerla commented Mar 10, 2022

khatrig commented Mar 11, 2022

glotzerhotze commented Mar 31, 2022

takyon77 commented Jun 23, 2022

bryantbiggs commented May 26, 2023

khatrig commented Nov 1, 2023

changhyuni commented Jun 11, 2024