Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible strategy detected auto No devices found. Waiting indefinitely #1134

Open
smartlocus opened this issue Jan 23, 2025 · 1 comment

Comments

@smartlocus
Copy link

Hello Guys, i have the k8s-device-plugin running on my second master kubernetes node which contains underlying gpu which is Geforce RTX 2060. The gpu works fine and i can run my machine learning trainings using docker also. I dont understand why the k8s-device plugin container in my kubernetes cluster can not see the gpu. I am using [CRI-O] as a default runtime and also have set it to default using --set-as-default. I would appreciate it if someone could assist me. Below i have put screenshots of the kubernetes device plugin container issue and also output of my nvidia-smi which shows that my gpu is working fine and also that my crio sevice is running.

Image

Image

Image

`I0123 16:54:07.204118 1 main.go:235] "Starting NVIDIA Device Plugin" version=<
d475b2c
commit: d475b2c

I0123 16:54:07.204992 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I0123 16:54:07.205170 1 main.go:245] Starting OS watcher.
I0123 16:54:07.205480 1 main.go:260] Starting Plugins.
I0123 16:54:07.205513 1 main.go:317] Loading configuration.
I0123 16:54:07.206134 1 main.go:342] Updating config with default resource matching patterns.
I0123 16:54:07.206267 1 main.go:353]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I0123 16:54:07.206276 1 main.go:356] Retrieving plugins.
E0123 16:54:07.206553 1 factory.go:112] Incompatible strategy detected auto
E0123 16:54:07.206570 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0123 16:54:07.206574 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0123 16:54:07.206577 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0123 16:54:07.206580 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0123 16:54:07.206583 1 main.go:381] No devices found. Waiting indefinitely.
`

@elezar
Copy link
Member

elezar commented Feb 3, 2025

@smartlocus could you exec into the device plugin container and confirm that you can run nvidia-smi in that container. If this works, then the device plugin should be detecting the available devices. If not, then the injection of the driver and devices from the host into its container is not working as expected.

What is your current crio config?
How is the NVIDIA Container Toolkit installed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants