Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions for GKE ubuntu #57

Closed
kozikow opened this issue Feb 28, 2018 · 13 comments
Closed

Instructions for GKE ubuntu #57

kozikow opened this issue Feb 28, 2018 · 13 comments

Comments

@kozikow
Copy link

kozikow commented Feb 28, 2018

I am trying to use this project with GKE node with --image-type=ubuntu, but it doesn't work out of the box. Any pointers how to make it work?

What I tried so far:

Start GKE cluster and ssh to the instance

gcloud beta container clusters create \
  --accelerator=type=nvidia-tesla-k80,count=1 \
  --zone=$CLUSTER_ZONE \
  --num-nodes=1 \
  --cluster-version=1.9.2-gke.1 \
  --machine-type=n1-standard-8 \
  --image-type=ubuntu \
  --scopes=https://www.googleapis.com/auth/devstorage.read_write \
  $CLUSTER_NAME

Install GPU libraries on the instance

I followed installation steps from https://cloud.google.com/compute/docs/gpus/add-gpus

curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
apt-get update
apt-get install cuda-8-0 -y

Libraries seemd to be installed in /usr/lib/nvidia-390

root@dev-kozikow-instance:/usr/lib/nvidia-390# ls /usr/lib/nvidia-390
alt_ld.so.conf             libGL.so                       libGLX_nvidia.so.0            libnvidia-egl-wayland.so.1           libnvidia-ifr.so.390.30
alternate-install-present  libGL.so.1                     libGLX_nvidia.so.390.30       libnvidia-egl-wayland.so.1.0.2       libnvidia-ml.so
bin                        libGL.so.390.30                libGLdispatch.so.0            libnvidia-eglcore.so.390.30          libnvidia-ml.so.1
bin-workdir                libGLESv1_CM.so                libOpenGL.so                  libnvidia-encode.so                  libnvidia-ml.so.390.30
drivers                    libGLESv1_CM.so.1              libOpenGL.so.0                libnvidia-encode.so.1                libnvidia-ptxjitcompiler.so
drivers-workdir            libGLESv1_CM_nvidia.so.1       libnvcuvid.so                 libnvidia-encode.so.390.30           libnvidia-ptxjitcompiler.so.1
ld.so.conf                 libGLESv1_CM_nvidia.so.390.30  libnvcuvid.so.1               libnvidia-fatbinaryloader.so.390.30  libnvidia-ptxjitcompiler.so.390.30
lib64                      libGLESv2.so                   libnvcuvid.so.390.30          libnvidia-fbc.so                     libnvidia-tls.so.390.30
lib64-workdir              libGLESv2.so.2                 libnvidia-cfg.so              libnvidia-fbc.so.1                   libnvidia-wfb.so.1
libEGL.so                  libGLESv2_nvidia.so.2          libnvidia-cfg.so.1            libnvidia-fbc.so.390.30              libnvidia-wfb.so.390.30
libEGL.so.1                libGLESv2_nvidia.so.390.30     libnvidia-cfg.so.390.30       libnvidia-glcore.so.390.30           tls
libEGL.so.390.30           libGLX.so                      libnvidia-compiler.so         libnvidia-glsi.so.390.30             vdpau
libEGL_nvidia.so.0         libGLX.so.0                    libnvidia-compiler.so.1       libnvidia-ifr.so                     xorg
libEGL_nvidia.so.390.30    libGLX_indirect.so.0           libnvidia-compiler.so.390.30  libnvidia-ifr.so.1

Start driver installer daemonset

Replaced all references to /home/kubernetes/bin/nvidia by /usr/lib/nvidia-390 in https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/daemonset.yaml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-driver-installer
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        name: nvidia-driver-installer
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - key: "nvidia.com/gpu"
        effect: "NoSchedule"
        operator: "Exists"
      hostNetwork: true
      hostPID: true
      volumes:
      - name: dev
        hostPath:
          path: /dev
      - name: nvidia-install-dir-host
        hostPath:
          path: /usr/lib/nvidia-390
      - name: root-mount
        hostPath:
          path: /
      initContainers:
      - image: gcr.io/google-containers/ubuntu-nvidia-driver-installer@sha256:7ffaf40fcf6bcc5bc87501b6be295a47ce74e1f7aac914a9f3e6c6fb8dd780a4
        name: nvidia-driver-installer
        resources:
          requests:
            cpu: 0.5
            memory: 512Mi
        securityContext:
          privileged: true
        env:
          - name: NVIDIA_INSTALL_DIR_HOST
            value: /usr/lib/nvidia-390
          - name: NVIDIA_INSTALL_DIR_CONTAINER
            value: /usr/local/nvidia
          - name: ROOT_MOUNT_DIR
            value: /root
        volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        - name: dev
          mountPath: /dev
        - name: root-mount
          mountPath: /root
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

Installer gets stuck

Installer gets stuck for hours on line "Updating container's ld cache...". FWIW it also gets stuck on this line if nvidia is not installed on host.

kubectl logs -f nvidia-driver-installer-88lqp --namespace=kube-system -c nvidia-driver-installer

+ NVIDIA_DRIVER_VERSION=384.111
+ NVIDIA_DRIVER_DOWNLOAD_URL_DEFAULT=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_DRIVER_DOWNLOAD_URL=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALL_DIR_HOST=/usr/lib/nvidia-390
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
++ basename https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALLER_RUNFILE=NVIDIA-Linux-x86_64-384.111.run
+ ROOT_MOUNT_DIR=/root
+ set +x
Downloading kernel sources...
Get:1 http://security.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:3 http://security.ubuntu.com/ubuntu xenial-security/universe Sources [72.8 kB]
Get:4 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [584 kB]
Get:5 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [12.7 kB]
Get:6 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [403 kB]
Get:7 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [3486 B]
Get:8 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [102 kB]
Get:9 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [102 kB]
Get:10 http://archive.ubuntu.com/ubuntu xenial/universe Sources [9802 kB]
Get:11 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:12 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:14 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe Sources [240 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [951 kB]
Get:17 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [13.1 kB]
Get:18 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [760 kB]
Get:19 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [18.5 kB]
Get:20 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [5153 B]
Get:21 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [7168 B]
Fetched 25.0 MB in 2s (9368 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  linux-gcp-headers-4.13.0-1007
The following NEW packages will be installed:
  linux-gcp-headers-4.13.0-1007 linux-headers-4.13.0-1007-gcp
0 upgraded, 2 newly installed, 0 to remove and 46 not upgraded.
Need to get 11.5 MB of archives.
After this operation, 84.5 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 linux-gcp-headers-4.13.0-1007 all 4.13.0-1007.10 [10.7 MB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 linux-headers-4.13.0-1007-gcp amd64 4.13.0-1007.10 [726 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 11.5 MB in 1s (6909 kB/s)
Selecting previously unselected package linux-gcp-headers-4.13.0-1007.
(Reading database ... 9482 files and directories currently installed.)
Preparing to unpack .../linux-gcp-headers-4.13.0-1007_4.13.0-1007.10_all.deb ...
Unpacking linux-gcp-headers-4.13.0-1007 (4.13.0-1007.10) ...
Selecting previously unselected package linux-headers-4.13.0-1007-gcp.
Preparing to unpack .../linux-headers-4.13.0-1007-gcp_4.13.0-1007.10_amd64.deb ...
Unpacking linux-headers-4.13.0-1007-gcp (4.13.0-1007.10) ...
Setting up linux-gcp-headers-4.13.0-1007 (4.13.0-1007.10) ...
Setting up linux-headers-4.13.0-1007-gcp (4.13.0-1007.10) ...
Downloading kernel sources... DONE.
Configuring installation directories...
/usr/local/nvidia /
Updating container's ld cache...
@rohitagarwal003
Copy link
Contributor

GPUs on Ubuntu image are not officially supported by GKE yet: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#limitations

However, they do work. :)

After starting the cluster, just run:

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset.yaml

No need to install anything manually, and no need to replace any paths.

@vishh
Copy link
Collaborator

vishh commented Feb 28, 2018

Can you describe why you'd like to use Ubuntu instead of COS?

@kozikow
Copy link
Author

kozikow commented Feb 28, 2018

However, they do work. :)

The installer gets stuck on "Updating container's ld cache...", even if I wait for hours. Repro steps for cluster started from scratch a moment ago: https://gist.github.com/kozikow/e3998c3b8a87840aa5c0aa684a21245d .

Can you describe why you'd like to use Ubuntu instead of COS?

For development. Using ubuntu node type allows me to ssh to GCE instance created by GCP, check out code and build images directly on the node. Using cos for that is not possible, as our docker build uses things like ansible or ansible-vault.

Other dev options for GPUs with kubernetes are worse:

  • Minikube does not support gpus ( GPU support kubernetes/minikube#2115 ).
  • Pushing images every time I make a change locally is very slow.
  • Using nvidia-docker-compose for development does not allow to use some kubernetes features.

@rohitagarwal003
Copy link
Contributor

The installer gets stuck on "Updating container's ld cache...", even if I wait for hours. Repro steps for cluster started from scratch a moment ago: https://gist.github.com/kozikow/e3998c3b8a87840aa5c0aa684a21245d .

I spoke too early. Try with 1.9.3-gke.0 which is rolling out this week.

@cmluciano
Copy link

I get the same error albeit I'm deploying to bare-metal. The ldconfig hangs indefinitely for me when using the installer code on a fresh 1.9.3 cluster.

OS: Ubuntu 16.04.4
Kernel: 4.4.0-116-generic

@rohitagarwal003
Copy link
Contributor

rohitagarwal003 commented Mar 2, 2018

It works now on the GKE 1.9.3 cluster (@kozikow can you confirm?)

The reason it was failing for GKE 1.9.2 is that GKE 1.9.2's ubuntu image has docker which uses aufs and GKE 1.9.3's ubuntu image has docker which uses overlay2.

Doesn't really have anything to do with Kubernetes version.

@cmluciano
Copy link

confirmed switching to overlay2 got me past the ldconfig hang too

@minterciso
Copy link

GPUs on Ubuntu image are not officially supported by GKE yet: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#limitations

However, they do work. :)

After starting the cluster, just run:

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset.yaml

No need to install anything manually, and no need to replace any paths.

Hello, I'm trying to make a Kubernetes with CUDA work for some days now.
I'm following the instructions on https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, which state just what you said, to use the 'kubectl create -f daemonset.yaml', and this goes smoothly, but even so the driver is not installed on the node, and consequently I'm unable to execute any docker image....

@chardch
Copy link
Contributor

chardch commented Jun 29, 2019

@minterciso Have you tried applying the following daemonset from https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers?

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

@minterciso
Copy link

@minterciso Have you tried applying the following daemonset from https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers?

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

Yeah I did, same issue. I found the solution last night though, it seems that the master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml (or the cos counterpart) is older than what I was using for developing. Fiddling around a little on the github I found that there's another daemonset.yaml, mainly on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml (only for COS) that is actually updated with the latest Nvidia driver. If someone wants to hear my tale... http://portfolio.geekvault.org/blog/5/ .
But basically:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml

@chardch
Copy link
Contributor

chardch commented Jul 1, 2019

@mintercisco I see you're using CUDA 10.1. The Nvidia 418 driver to support CUDA 10.1 is not yet supported for Ubuntu on GKE. The COS image has support for 418 starting from 1.13.6-gke.6.

@minterciso
Copy link

@mintercisco I see you're using CUDA 10.1. The Nvidia 418 driver to support CUDA 10.1 is not yet supported for Ubuntu on GKE. The COS image has support for 418 starting from 1.13.6-gke.6.

Yeah, that was what I found out as well, but honestly, the documentation is very blur on that, specially because even if using the COS (as stated on the document) it gives only CUDA 10.0.

@chardch
Copy link
Contributor

chardch commented Jul 2, 2019

Yea agreed. The document will be updated soon to reflect the changes, since 1.13.6-gke.6 was pretty recent. The preloaded daemonset will install the 418 driver for 1.13.6-gke.6 and above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants