Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-driver-installer crash loop during GKE scale ups #132

Open
brannondorsey opened this issue Jan 13, 2020 · 8 comments
Open

nvidia-driver-installer crash loop during GKE scale ups #132

brannondorsey opened this issue Jan 13, 2020 · 8 comments

Comments

@brannondorsey
Copy link

brannondorsey commented Jan 13, 2020

We've been using the nvidia-driver-installer on Ubuntu node groups via GKE v1.15 per the official How-to GPU instructions specified here.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

The daemonset deployed via daemonset-preloaded.yaml appeared to work correctly for some time, however we started noticing issues last Friday when new nodes were added to the node group via cluster autoscaling. The nvidia-driver-installer daemonset pods that were scheduled to these new nodes began to crash loop, as their initContainers were exiting with non-zero exit codes.

Upon examining pod logs, it appears that the failed pods contain the following lines as their last output before exiting.

Verifying Nvidia installation... DONE. 
ln: /root/home/kubernetes/bin/nvidia: cannot overwrite directory

See here for full log output from one of the failed pods.

I've logged into one of the nodes and manually removed the /root/home/kubernetes/bin/nvidia folder (which is presumably created by the very first instance of the nvidia-driver-installer pod scheduled to a node when it comes up) but the folder re-appears and the daemonset pods continue to crash in loop. Nodes that have daemonset pods in this state don't have the drivers correctly installed, and jobs that require them fail to import CUDA due to driver issues.

We've been experiencing this issue for 4 days now with nodes that receive live production traffic. Not every node that scales up experiences this problem, but most do. If a node comes up and its nvidia-driver-installer pod begins to crash, we've had no luck bringing it out of that state. Instead we've manually marked the node as unschedulable and brought it down, hoping the next to come up won't experience the same problem.

From our perspective, nothing has changed with our cluster configuration, node group configuration, or K8s manifests that would cause this issue to start occurring. We did experience something similar for a few hours in mid December, but the issue resolved itself within a few hours and we didn't think much of it. I'm happy to provide more logs or detailed information about the errors upon request!

Any thoughts about what could be causing this?

@brannondorsey brannondorsey changed the title nvidia-driver-install crash loop during GKE scale ups nvidia-driver-installer crash loop during GKE scale ups Jan 13, 2020
@karan
Copy link
Contributor

karan commented Jan 29, 2020

I can't repro this with 1.15 on GKE.

gcloud container clusters create gpu-test --accelerator type=nvidia-tesla-k80,count=1 --zone=us-central1-c --num-nodes=1 --cluster-version=1.15

Then:

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
daemonset.apps/nvidia-driver-installer created

Then I scaled instances from 1 to 2, and saw that the driver is running fine.

$ ka get po | grep nvidia
kube-system   nvidia-driver-installer-c8jcm                               1/1     Running   0          3m34s
kube-system   nvidia-driver-installer-hrghh                               1/1     Running   0          12m
kube-system   nvidia-gpu-device-plugin-6jql6                              1/1     Running   0          3m34s
kube-system   nvidia-gpu-device-plugin-fdzf7                              1/1     Running   0          13m

Are you still seeing this? If so, can you please provide repro steps?

@tangenti
Copy link
Contributor

We are unable to locate the exact root cause since the repro steps are missing here.
A potential fix on gRPC is submitted in #135.

@adityapatadia
Copy link

This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.

@ruiwen-zhao
Copy link
Contributor

This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.

Can you provide the GKE version and the OS where you reproduced this error?

@adityapatadia
Copy link

adityapatadia commented Mar 25, 2021 via email

@ClementGautier
Copy link

I encountered the same issue on:

  • Kernel version 5.4.0-1044-gke
  • OS image Ubuntu 20.04.2 LTS
  • Container runtime version docker://19.3.8
  • kubelet version v1.20.8-gke.900
  • kube-proxy version v1.20.8-gke.900
  • 2x Tesla P100

What's weird is that /root/home doesn't even exists on the node so I've no idea why the link is failing at being created by the pod. I tried to update to the latest version of the daemonset and it didn't help.

@rarestg
Copy link

rarestg commented Dec 15, 2021

Getting the same issue here on:

  • Kernel version 5.4.0-1051-gke
  • OS image Ubuntu 18.04.5 LTS
  • Container runtime version docker://19.3.2
  • kubelet version v1.19.14-gke.1900
  • kube-proxy version v1.19.14-gke.1900

I can't get any logs out of the pod but describing it shows the error:

Controlled By:  DaemonSet/nvidia-driver-installer-ubuntu
Init Containers:
  nvidia-driver-installer:
    Container ID:   docker://3f92ca08c6a68900de40a0fc98b236240722191b06b1eaf00fa8ba67be04ffbe
    Image:          gke-nvidia-installer:fixed
    Image ID:       docker://sha256:50944645cd9975d5b2c904353e1ab5b2cdd41f4e959aefbe7b2624d0b8c43652
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 15 Dec 2021 15:48:43 -0500
      Finished:     Wed, 15 Dec 2021 15:48:55 -0500
    Ready:          False
    Restart Count:  1950
    Requests:
      cpu:        150m
    Environment:  <none>
    Mounts:
      /boot from boot (rw)
      /dev from dev (rw)
      /root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m9gck (ro)

@omer-dayan
Copy link

Just for you information. If you want to fix it there is a workaround:
SSH into the node and run:
image

After that restart the pod of the installer and it will run successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants