nvidia-device-plugin fails to recover after apiserver is inaccessible for a moment #1126

Ongy · 2025-01-16T09:21:27Z

We've seen a couple of instances where the nvidia-device-plugin pod fails when the apiserver is not available for a short amount of time.

In this situation, any PODs that require the GPU as resource cannot be scheduled, since the plugin no longer provides it to kubelet.

We see the following in the logs a couple times, but nothing interesting before, or after these messages.

/build/cmd/config-manager/main.go:287: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.28.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dlab-mtv-qa03&resourceVersion=13505357": net/http: TLS handshake timeout
/build/cmd/config-manager/main.go:287: failed to list *v1.Node: Get "https://172.28.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dlab-mtv-qa03&resourceVersion=13505357": net/http: TLS handshake timeou
Trace[1419135208]: "Reflector ListAndWatch" name:/build/cmd/config-manager/main.go:287 (05-Jan-2025 07:08:27.960) (total time: 32816ms):
Trace[1419135208]: ---"Objects listed" error:Get "https://172.28.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dlab-mtv-qa03&resourceVersion=13505357": net/http: TLS handshake timeout 32513ms (07:09:00.473)
Trace[1419135208]: [32.816731696s] [32.816731696s] END
/build/cmd/config-manager/main.go:287: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.28.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dlab-mtv-qa03&resourceVersion=13505357": net/http: TLS handshake timeout

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-device-plugin fails to recover after apiserver is inaccessible for a moment #1126

nvidia-device-plugin fails to recover after apiserver is inaccessible for a moment #1126

Ongy commented Jan 16, 2025

nvidia-device-plugin fails to recover after apiserver is inaccessible for a moment #1126

nvidia-device-plugin fails to recover after apiserver is inaccessible for a moment #1126

Comments

Ongy commented Jan 16, 2025