Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support multiple GPU driver versions in one k8s cluster #323

Closed
khatrig opened this issue Feb 16, 2022 · 9 comments
Closed

Comments

@khatrig
Copy link

khatrig commented Feb 16, 2022

1. feature description

Currently, the gpu-operator only supports one driver version across the cluster. It'd be great if each GPU node could have a different driver version based on the requirement.

Eg. There are two GPU nodes A and B in a k8s cluster.
Node A could have driver version X and node B could have driver version Y.

@shivamerla
Copy link
Contributor

Thanks for the feature request. This will indeed be a great feature. Currently only way this can be done is with pre-installed drivers on the host with GPU operator. We would need to introduce nvidia.com/gpu.deploy.driver: <version> labels to achieve this and a Daemonset for each version needs to be created. We would need to create a Config/CustomResource for users to provide this mapping of <node_label> to <driver_version>, where node label can be based on GPU types etc. Supporting heterogenous nodes in the cluster is in the roadmap and this will be handled along with it.

@khatrig
Copy link
Author

khatrig commented Feb 27, 2022

Thanks for the feature request. This will indeed be a great feature. Currently only way this can be done is with pre-installed drivers on the host with GPU operator. We would need to introduce nvidia.com/gpu.deploy.driver: <version> labels to achieve this and a Daemonset for each version needs to be created. We would need to create a Config/CustomResource for users to provide this mapping of <node_label> to <driver_version>, where node label can be based on GPU types etc. Supporting heterogenous nodes in the cluster is in the roadmap and this will be handled along with it.

@shivamerla Thanks for considering this.
Another way of doing this could be, A driver version controller (could be a Kubernetes deployment) that understands label nvidia.com/gpu.deploy.driver: added on GPU nodes. You add a specific label on the nodes and the driver version controller triggers the driver installation process on GPU nodes. This way it might be possible to avoid running multiple DaemonSets.

@shivamerla
Copy link
Contributor

@khatrig, we currently package a single driver version into each image, hence the requirement to have separate daemonsets.

@khatrig
Copy link
Author

khatrig commented Mar 11, 2022

I understand now what you meant earlier. Thanks for the explanation.

@glotzerhotze
Copy link

@shivamerla

If this feature is developed, could we add a sanity check for deploying the latest driver compatible with the hardware found on a node?

For example:
The latest upgrade from 1.9.1 to 1.10.0 brought driver version 510.x.y - which dropped support for Kepler architecture. Thus when running a K80 device on a node, building the 510.x.y driver would fail with "no hardware found" error.

It would be nice if the operator would sanity-check the current devices on a node and select the latest supported driver auto-magically, preventing breakage that we currently see in such an use-case.

Thanks

@takyon77
Copy link

@shivamerla I would also be interested for having different drivers. We are currently trying to include in the same cluster different GPUs (A100 and K40m) and I can't find a single driver working for both (latest 515 is working for A100 but not for K40m and latest 470 is working for K40m but doesn't see the device on A100)
Any guestimation on when this feature could be available?

@bryantbiggs
Copy link

wouldn't this be possible today through different nodegroups with the appropriate labels/taints? it would require multiple daemonsets to be deployed, but it seems like it could work - no?

  1. A100 nodegroup - tainted with gpuVer=A100:NoSchedule, daemonset deployed with toleration for said taint
  2. K40m nodegroup - tainted with gpuVer=K40m:NoSchedule, daemonset deployed with toleration for said taint

@khatrig
Copy link
Author

khatrig commented Nov 1, 2023

Glad to see that gpu-operator now has a feature(tech preview) that makes it possible to run multiple driver versions in the same cluster.

Closing.

@khatrig khatrig closed this as completed Nov 1, 2023
@changhyuni
Copy link

@khatrig
I've been looking at this feature, but doesn't this mean that we have to partition the worker nodes by driver version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants