Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-k8s-device-plugin: change service dependency #3141

Merged
merged 1 commit into from
May 24, 2023

Conversation

arnaldo2792
Copy link
Contributor

Issue number:

Should close #3132

Description of changes:
The NVIDIA device plugin knows how to re-connect to the kubelet in case it isn't ready or it re-starts. Thus, it is sufficient to declare the dependency on the kubelet with "Wants" instead of a stronger dependecy directive.

Testing done:

  • I restarted the kubelet, and confirmed the NVIDIA k8s device plugin was up
  • I stopped and started the kubelet, and confirmed the NVIDIA device plugin reconnected to the kubelet
  • I confirmed that the pods were accessible and that I can execute nvidia-smi

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

The NVIDIA device plugin knows how to re-connect to the kubelet in case
it isn't ready or it re-starts. Thus, it is sufficient to declare the
dependency on the kubelet with "Wants" instead of a stronger dependency
directive.

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792 arnaldo2792 changed the title nvidia-k8s-device-plugin: change service depency nvidia-k8s-device-plugin: change service dependency May 24, 2023
@arnaldo2792
Copy link
Contributor Author

(forced push to fix typo in commit message)

@bcressey
Copy link
Contributor

I confirmed that the pods were accessible and that I can execute nvidia-smi

This might be what you covered but can you confirm that this sequence works?

  • system comes up
  • kubelet restarted manually
  • (device plugin reconnects to kubelet socket)
  • new pod launched and still has access to GPU device

Essentially, validate that the device plugin survives a kubelet restart and continues to function afterwards.

@arnaldo2792
Copy link
Contributor Author

Essentially, validate that the device plugin survives a kubelet restart and continues to function afterwards.

Yes, I confirmed this by looking at the logs (I trimmed the timestamps for readability). The first line appeared after I ran systemctl restart kubelet.service

nvidia-device-plugin[1833]: I0524 16:04:50.652300    1833 main.go:202] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
nvidia-device-plugin[1833]: I0524 16:04:50.652333    1833 main.go:294] Stopping plugins.
nvidia-device-plugin[1833]: I0524 16:04:50.652343    1833 server.go:142] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
nvidia-device-plugin[1833]: I0524 16:04:50.652429    1833 main.go:176] Starting Plugins.
nvidia-device-plugin[1833]: I0524 16:04:50.652435    1833 main.go:234] Loading configuration.

And the new pod running:

default       gpu-tests-g5xrz            1/1     Running   0          22s

@arnaldo2792 arnaldo2792 merged commit 0888cb2 into bottlerocket-os:develop May 24, 2023
etungsten added a commit that referenced this pull request May 24, 2023
@arnaldo2792 arnaldo2792 deleted the fix-device-plugin branch June 19, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nvidia-variant random nvidia-device-plugin failures on boot-time
4 participants