Instructions for GKE ubuntu #57

kozikow · 2018-02-28T18:37:05Z

I am trying to use this project with GKE node with --image-type=ubuntu, but it doesn't work out of the box. Any pointers how to make it work?

What I tried so far:

Start GKE cluster and ssh to the instance

gcloud beta container clusters create \
  --accelerator=type=nvidia-tesla-k80,count=1 \
  --zone=$CLUSTER_ZONE \
  --num-nodes=1 \
  --cluster-version=1.9.2-gke.1 \
  --machine-type=n1-standard-8 \
  --image-type=ubuntu \
  --scopes=https://www.googleapis.com/auth/devstorage.read_write \
  $CLUSTER_NAME

Install GPU libraries on the instance

I followed installation steps from https://cloud.google.com/compute/docs/gpus/add-gpus

curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
apt-get update
apt-get install cuda-8-0 -y

Libraries seemd to be installed in /usr/lib/nvidia-390

root@dev-kozikow-instance:/usr/lib/nvidia-390# ls /usr/lib/nvidia-390
alt_ld.so.conf             libGL.so                       libGLX_nvidia.so.0            libnvidia-egl-wayland.so.1           libnvidia-ifr.so.390.30
alternate-install-present  libGL.so.1                     libGLX_nvidia.so.390.30       libnvidia-egl-wayland.so.1.0.2       libnvidia-ml.so
bin                        libGL.so.390.30                libGLdispatch.so.0            libnvidia-eglcore.so.390.30          libnvidia-ml.so.1
bin-workdir                libGLESv1_CM.so                libOpenGL.so                  libnvidia-encode.so                  libnvidia-ml.so.390.30
drivers                    libGLESv1_CM.so.1              libOpenGL.so.0                libnvidia-encode.so.1                libnvidia-ptxjitcompiler.so
drivers-workdir            libGLESv1_CM_nvidia.so.1       libnvcuvid.so                 libnvidia-encode.so.390.30           libnvidia-ptxjitcompiler.so.1
ld.so.conf                 libGLESv1_CM_nvidia.so.390.30  libnvcuvid.so.1               libnvidia-fatbinaryloader.so.390.30  libnvidia-ptxjitcompiler.so.390.30
lib64                      libGLESv2.so                   libnvcuvid.so.390.30          libnvidia-fbc.so                     libnvidia-tls.so.390.30
lib64-workdir              libGLESv2.so.2                 libnvidia-cfg.so              libnvidia-fbc.so.1                   libnvidia-wfb.so.1
libEGL.so                  libGLESv2_nvidia.so.2          libnvidia-cfg.so.1            libnvidia-fbc.so.390.30              libnvidia-wfb.so.390.30
libEGL.so.1                libGLESv2_nvidia.so.390.30     libnvidia-cfg.so.390.30       libnvidia-glcore.so.390.30           tls
libEGL.so.390.30           libGLX.so                      libnvidia-compiler.so         libnvidia-glsi.so.390.30             vdpau
libEGL_nvidia.so.0         libGLX.so.0                    libnvidia-compiler.so.1       libnvidia-ifr.so                     xorg
libEGL_nvidia.so.390.30    libGLX_indirect.so.0           libnvidia-compiler.so.390.30  libnvidia-ifr.so.1

Start driver installer daemonset

Replaced all references to /home/kubernetes/bin/nvidia by /usr/lib/nvidia-390 in https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/daemonset.yaml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-driver-installer
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        name: nvidia-driver-installer
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - key: "nvidia.com/gpu"
        effect: "NoSchedule"
        operator: "Exists"
      hostNetwork: true
      hostPID: true
      volumes:
      - name: dev
        hostPath:
          path: /dev
      - name: nvidia-install-dir-host
        hostPath:
          path: /usr/lib/nvidia-390
      - name: root-mount
        hostPath:
          path: /
      initContainers:
      - image: gcr.io/google-containers/ubuntu-nvidia-driver-installer@sha256:7ffaf40fcf6bcc5bc87501b6be295a47ce74e1f7aac914a9f3e6c6fb8dd780a4
        name: nvidia-driver-installer
        resources:
          requests:
            cpu: 0.5
            memory: 512Mi
        securityContext:
          privileged: true
        env:
          - name: NVIDIA_INSTALL_DIR_HOST
            value: /usr/lib/nvidia-390
          - name: NVIDIA_INSTALL_DIR_CONTAINER
            value: /usr/local/nvidia
          - name: ROOT_MOUNT_DIR
            value: /root
        volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        - name: dev
          mountPath: /dev
        - name: root-mount
          mountPath: /root
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

Installer gets stuck

Installer gets stuck for hours on line "Updating container's ld cache...". FWIW it also gets stuck on this line if nvidia is not installed on host.

kubectl logs -f nvidia-driver-installer-88lqp --namespace=kube-system -c nvidia-driver-installer

+ NVIDIA_DRIVER_VERSION=384.111
+ NVIDIA_DRIVER_DOWNLOAD_URL_DEFAULT=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_DRIVER_DOWNLOAD_URL=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALL_DIR_HOST=/usr/lib/nvidia-390
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
++ basename https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALLER_RUNFILE=NVIDIA-Linux-x86_64-384.111.run
+ ROOT_MOUNT_DIR=/root
+ set +x
Downloading kernel sources...
Get:1 http://security.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:3 http://security.ubuntu.com/ubuntu xenial-security/universe Sources [72.8 kB]
Get:4 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [584 kB]
Get:5 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [12.7 kB]
Get:6 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [403 kB]
Get:7 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [3486 B]
Get:8 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [102 kB]
Get:9 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [102 kB]
Get:10 http://archive.ubuntu.com/ubuntu xenial/universe Sources [9802 kB]
Get:11 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:12 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:14 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe Sources [240 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [951 kB]
Get:17 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [13.1 kB]
Get:18 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [760 kB]
Get:19 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [18.5 kB]
Get:20 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [5153 B]
Get:21 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [7168 B]
Fetched 25.0 MB in 2s (9368 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  linux-gcp-headers-4.13.0-1007
The following NEW packages will be installed:
  linux-gcp-headers-4.13.0-1007 linux-headers-4.13.0-1007-gcp
0 upgraded, 2 newly installed, 0 to remove and 46 not upgraded.
Need to get 11.5 MB of archives.
After this operation, 84.5 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 linux-gcp-headers-4.13.0-1007 all 4.13.0-1007.10 [10.7 MB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 linux-headers-4.13.0-1007-gcp amd64 4.13.0-1007.10 [726 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 11.5 MB in 1s (6909 kB/s)
Selecting previously unselected package linux-gcp-headers-4.13.0-1007.
(Reading database ... 9482 files and directories currently installed.)
Preparing to unpack .../linux-gcp-headers-4.13.0-1007_4.13.0-1007.10_all.deb ...
Unpacking linux-gcp-headers-4.13.0-1007 (4.13.0-1007.10) ...
Selecting previously unselected package linux-headers-4.13.0-1007-gcp.
Preparing to unpack .../linux-headers-4.13.0-1007-gcp_4.13.0-1007.10_amd64.deb ...
Unpacking linux-headers-4.13.0-1007-gcp (4.13.0-1007.10) ...
Setting up linux-gcp-headers-4.13.0-1007 (4.13.0-1007.10) ...
Setting up linux-headers-4.13.0-1007-gcp (4.13.0-1007.10) ...
Downloading kernel sources... DONE.
Configuring installation directories...
/usr/local/nvidia /
Updating container's ld cache...

The text was updated successfully, but these errors were encountered:

rohitagarwal003 · 2018-02-28T18:57:00Z

GPUs on Ubuntu image are not officially supported by GKE yet: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#limitations

However, they do work. :)

After starting the cluster, just run:

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset.yaml

No need to install anything manually, and no need to replace any paths.

vishh · 2018-02-28T19:11:54Z

Can you describe why you'd like to use Ubuntu instead of COS?

kozikow · 2018-02-28T19:26:34Z

However, they do work. :)

The installer gets stuck on "Updating container's ld cache...", even if I wait for hours. Repro steps for cluster started from scratch a moment ago: https://gist.github.com/kozikow/e3998c3b8a87840aa5c0aa684a21245d .

Can you describe why you'd like to use Ubuntu instead of COS?

For development. Using ubuntu node type allows me to ssh to GCE instance created by GCP, check out code and build images directly on the node. Using cos for that is not possible, as our docker build uses things like ansible or ansible-vault.

Other dev options for GPUs with kubernetes are worse:

Minikube does not support gpus ( GPU support kubernetes/minikube#2115 ).
Pushing images every time I make a change locally is very slow.
Using nvidia-docker-compose for development does not allow to use some kubernetes features.

rohitagarwal003 · 2018-03-01T16:42:16Z

The installer gets stuck on "Updating container's ld cache...", even if I wait for hours. Repro steps for cluster started from scratch a moment ago: https://gist.github.com/kozikow/e3998c3b8a87840aa5c0aa684a21245d .

I spoke too early. Try with 1.9.3-gke.0 which is rolling out this week.

cmluciano · 2018-03-02T20:52:11Z

I get the same error albeit I'm deploying to bare-metal. The ldconfig hangs indefinitely for me when using the installer code on a fresh 1.9.3 cluster.

OS: Ubuntu 16.04.4
Kernel: 4.4.0-116-generic

rohitagarwal003 · 2018-03-02T21:15:38Z

It works now on the GKE 1.9.3 cluster (@kozikow can you confirm?)

The reason it was failing for GKE 1.9.2 is that GKE 1.9.2's ubuntu image has docker which uses aufs and GKE 1.9.3's ubuntu image has docker which uses overlay2.

Doesn't really have anything to do with Kubernetes version.

cmluciano · 2018-03-06T19:05:35Z

confirmed switching to overlay2 got me past the ldconfig hang too

minterciso · 2019-06-28T20:29:07Z

GPUs on Ubuntu image are not officially supported by GKE yet: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#limitations

However, they do work. :)

After starting the cluster, just run:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset.yaml
No need to install anything manually, and no need to replace any paths.

Hello, I'm trying to make a Kubernetes with CUDA work for some days now.
I'm following the instructions on https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, which state just what you said, to use the 'kubectl create -f daemonset.yaml', and this goes smoothly, but even so the driver is not installed on the node, and consequently I'm unable to execute any docker image....

chardch · 2019-06-29T18:31:42Z

@minterciso Have you tried applying the following daemonset from https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers?

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

minterciso · 2019-06-29T20:08:40Z

@minterciso Have you tried applying the following daemonset from https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers?

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

Yeah I did, same issue. I found the solution last night though, it seems that the master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml (or the cos counterpart) is older than what I was using for developing. Fiddling around a little on the github I found that there's another daemonset.yaml, mainly on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml (only for COS) that is actually updated with the latest Nvidia driver. If someone wants to hear my tale... http://portfolio.geekvault.org/blog/5/ .
But basically:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml

chardch · 2019-07-01T17:27:03Z

@mintercisco I see you're using CUDA 10.1. The Nvidia 418 driver to support CUDA 10.1 is not yet supported for Ubuntu on GKE. The COS image has support for 418 starting from 1.13.6-gke.6.

minterciso · 2019-07-02T00:31:38Z

@mintercisco I see you're using CUDA 10.1. The Nvidia 418 driver to support CUDA 10.1 is not yet supported for Ubuntu on GKE. The COS image has support for 418 starting from 1.13.6-gke.6.

Yeah, that was what I found out as well, but honestly, the documentation is very blur on that, specially because even if using the COS (as stated on the document) it gives only CUDA 10.0.

chardch · 2019-07-02T17:47:31Z

Yea agreed. The document will be updated soon to reflect the changes, since 1.13.6-gke.6 was pretty recent. The preloaded daemonset will install the 418 driver for 1.13.6-gke.6 and above

rohitagarwal003 closed this as completed Mar 6, 2018

rohitagarwal003 mentioned this issue Jun 18, 2018

Installer freezes node on updating_container_ld_cache #71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instructions for GKE ubuntu #57

Instructions for GKE ubuntu #57

kozikow commented Feb 28, 2018 •

edited

Loading

rohitagarwal003 commented Feb 28, 2018

vishh commented Feb 28, 2018

kozikow commented Feb 28, 2018 •

edited

Loading

rohitagarwal003 commented Mar 1, 2018

cmluciano commented Mar 2, 2018

rohitagarwal003 commented Mar 2, 2018 •

edited

Loading

cmluciano commented Mar 6, 2018

minterciso commented Jun 28, 2019

chardch commented Jun 29, 2019

minterciso commented Jun 29, 2019

chardch commented Jul 1, 2019

minterciso commented Jul 2, 2019

chardch commented Jul 2, 2019

Instructions for GKE ubuntu #57

Instructions for GKE ubuntu #57

Comments

kozikow commented Feb 28, 2018 • edited Loading

Start GKE cluster and ssh to the instance

Install GPU libraries on the instance

Libraries seemd to be installed in /usr/lib/nvidia-390

Start driver installer daemonset

Installer gets stuck

rohitagarwal003 commented Feb 28, 2018

vishh commented Feb 28, 2018

kozikow commented Feb 28, 2018 • edited Loading

rohitagarwal003 commented Mar 1, 2018

cmluciano commented Mar 2, 2018

rohitagarwal003 commented Mar 2, 2018 • edited Loading

cmluciano commented Mar 6, 2018

minterciso commented Jun 28, 2019

chardch commented Jun 29, 2019

minterciso commented Jun 29, 2019

chardch commented Jul 1, 2019

minterciso commented Jul 2, 2019

chardch commented Jul 2, 2019

kozikow commented Feb 28, 2018 •

edited

Loading

kozikow commented Feb 28, 2018 •

edited

Loading

rohitagarwal003 commented Mar 2, 2018 •

edited

Loading