-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instructions for GKE ubuntu #57
Comments
GPUs on Ubuntu image are not officially supported by GKE yet: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#limitations However, they do work. :) After starting the cluster, just run:
No need to install anything manually, and no need to replace any paths. |
Can you describe why you'd like to use Ubuntu instead of COS? |
The installer gets stuck on "Updating container's ld cache...", even if I wait for hours. Repro steps for cluster started from scratch a moment ago: https://gist.github.com/kozikow/e3998c3b8a87840aa5c0aa684a21245d .
For development. Using ubuntu node type allows me to ssh to GCE instance created by GCP, check out code and build images directly on the node. Using cos for that is not possible, as our docker build uses things like ansible or ansible-vault. Other dev options for GPUs with kubernetes are worse:
|
I spoke too early. Try with 1.9.3-gke.0 which is rolling out this week. |
I get the same error albeit I'm deploying to bare-metal. The ldconfig hangs indefinitely for me when using the installer code on a fresh 1.9.3 cluster. OS: Ubuntu 16.04.4 |
It works now on the GKE 1.9.3 cluster (@kozikow can you confirm?) The reason it was failing for GKE 1.9.2 is that GKE 1.9.2's ubuntu image has docker which uses aufs and GKE 1.9.3's ubuntu image has docker which uses overlay2. Doesn't really have anything to do with Kubernetes version. |
confirmed switching to overlay2 got me past the ldconfig hang too |
Hello, I'm trying to make a Kubernetes with CUDA work for some days now. |
@minterciso Have you tried applying the following daemonset from https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers?
|
Yeah I did, same issue. I found the solution last night though, it seems that the master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml (or the cos counterpart) is older than what I was using for developing. Fiddling around a little on the github I found that there's another daemonset.yaml, mainly on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml (only for COS) that is actually updated with the latest Nvidia driver. If someone wants to hear my tale... http://portfolio.geekvault.org/blog/5/ .
|
@mintercisco I see you're using CUDA 10.1. The Nvidia 418 driver to support CUDA 10.1 is not yet supported for Ubuntu on GKE. The COS image has support for 418 starting from 1.13.6-gke.6. |
Yeah, that was what I found out as well, but honestly, the documentation is very blur on that, specially because even if using the COS (as stated on the document) it gives only CUDA 10.0. |
Yea agreed. The document will be updated soon to reflect the changes, since 1.13.6-gke.6 was pretty recent. The preloaded daemonset will install the 418 driver for 1.13.6-gke.6 and above |
I am trying to use this project with GKE node with --image-type=ubuntu, but it doesn't work out of the box. Any pointers how to make it work?
What I tried so far:
Start GKE cluster and ssh to the instance
Install GPU libraries on the instance
I followed installation steps from https://cloud.google.com/compute/docs/gpus/add-gpus
Libraries seemd to be installed in /usr/lib/nvidia-390
Start driver installer daemonset
Replaced all references to /home/kubernetes/bin/nvidia by /usr/lib/nvidia-390 in https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/daemonset.yaml
Installer gets stuck
Installer gets stuck for hours on line "Updating container's ld cache...". FWIW it also gets stuck on this line if nvidia is not installed on host.
kubectl logs -f nvidia-driver-installer-88lqp --namespace=kube-system -c nvidia-driver-installer
The text was updated successfully, but these errors were encountered: