Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support #2115

Closed
kozikow opened this issue Oct 24, 2017 · 15 comments
Closed

GPU support #2115

kozikow opened this issue Oct 24, 2017 · 15 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@kozikow
Copy link
Contributor

kozikow commented Oct 24, 2017

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

FEATURE REQUEST

**Description **:

It would be really great to run GPU workloads on minikube.

I successfully ran GPU workload on GKE using instructions from https://docs.google.com/document/d/1hYOqaOVSu68ZaUsmCKwyP6kf6UtlTMiE_hxoJ2uUqvs/edit# . I was looking to replicate this in minikube.

Example pod that successfully runs GPU workload on GKE:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-container
spec:
  volumes:
    - name: nvidia-libraries
      hostPath:
        path: /home/kubernetes/bin/nvidia/lib
  containers:
  - name: gpu-container
    image: mxnet/python:gpu
    args:
      - python
      - -c
      - "import mxnet as mx; a = mx.nd.ones((2, 3), mx.gpu()); b = a * 2 + 1; print b.asnumpy()"
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
    volumeMounts:
    - name: nvidia-libraries
      mountPath: /usr/local/nvidia/lib64

Expected output: [[ 3. 3. 3.] [ 3. 3. 3.]]

I was looking to replicate this workflow within minikube. I have correct GPU local setup that runs the image in nvidia-docker.

I installed and started local minikube with:

wget https://storage.googleapis.com/minikube-builds/2050/minikube-linux-amd64 && mv minikube-linux-amd64 /usr/bin/minikube && chmod +x /usr/bin/minikube
curl -Lo kubectl https://storage.googleapis.com/kubernetes-release/release/v1.8.0/bin/linux/amd64/kubectl && chmod +x kubectl
sudo gsutil cp gs://minikube/k8sReleases/v1.8.0/localkube-linux-amd64 /usr/local/bin/localkube && chmod +x localkube

export MINIKUBE_WANTUPDATENOTIFICATION=false
export MINIKUBE_WANTREPORTERRORPROMPT=false
export MINIKUBE_HOME=$HOME
export CHANGE_MINIKUBE_NONE_USER=true
mkdir $HOME/.kube || true
touch $HOME/.kube/config
export KUBECONFIG=$HOME/.kube/config
sudo -E minikube start --vm-driver=none

I copied all required cuda and nvidia libraries into local host dirtectory /home/kubernetes/bin/nvidia/lib

I added GPU node capacity:

kubectl proxy
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/alpha.kubernetes.io~1nvidia-gpu", "value": "1"}]' \
http://127.0.0.1:8001/api/v1/nodes/kozikowpc/status

Yet when I start the same pod as on GKE I get pod status "CreateContainerConfigError" and event kubelet, kozikowpc Error: GPUs are not supported. I've seen some code for GPU support in minikube: https://github.com/kubernetes/minikube/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/gpu/nvidia/nvidia_gpu_manager.go . Is there anything I am doing wrong?

@r2d4
Copy link
Contributor

r2d4 commented Oct 24, 2017

@kozikow Have you enabled the feature gate Accelerators=true? Not sure if thats still required but a google search returned that.

@kozikow
Copy link
Contributor Author

kozikow commented Oct 24, 2017

After adding "--feature-gates=Accelerators=true" to minikube the container starts, but I get cuda libraries errors: https://gist.github.com/kozikow/be44083d4812c554d84271edf01853aa . Same workflow succeeds on GKE or in nvidia-docker .

For reference, other gpu workflow succeeds with similar pod configuration:

Image: gcr.io/tensorflow/tensorflow:latest-gpu
Code:

import tensorflow as tf
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

I suspected that there is a cuda library mismatch between my host machine and container. However, it successfully starts in nvidia-docker. Is there any magic that nvidia-docker is doing that I am missing?

@vishh
Copy link
Contributor

vishh commented Oct 25, 2017

We do not support this use case yet because it wasn't clear if minikube is used for spinning up k8s clusters on linux hosts. On hosts where minikube spins up a VM it is harder to consume GPUs since it requires isolating and attaching extra GPUs on the host to minikube VM.
@dlorenc is vm-driver=none supported officially by minikube? Would it make sense to use kubeadm instead?

@r2d4
Copy link
Contributor

r2d4 commented Oct 25, 2017

@vishh We already have an option to use kubeadm. The "none" driver is officially supported for localkube and possibly soon the kubeadm bootstrapper.

The "none" driver runs the cluster directly on the host without a VM.

@kozikow
Copy link
Contributor Author

kozikow commented Oct 26, 2017

Is there a recommended way of local testing of GPU workloads? minikube with --vm-driver=none --feature-gates=Accelerators=true gets pretty close to achieving this task - some GPU containers are running successfully.

I suspect the only missing link is some CUDA library trickery that GKE or nvidia-docker are doing. I have been reading the code of nvidia-docker ( https://github.com/NVIDIA/nvidia-docker ) or GKE GPU installer ( https://github.com/ContainerEngine/accelerators/tree/master/cos-nvidia-gpu-installer ), but I didn't find anything yet.

@sebastianlach
Copy link

Same problem here. Would be great if you could advise how to solve this problem.

@r2d4 r2d4 added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 9, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2018
@sebastianlach
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2018
@kozikow
Copy link
Contributor Author

kozikow commented Feb 23, 2018

FWIW I described our current setup for developing GPU containers on kubernetes in https://tensorflight.blog/2018/02/23/dev-environment-for-gke/ . Please let me know if minikube gets GPU support or there is any other way.

@Nick-Harvey
Copy link

I too would like to see this happen 😀

@vishh
Copy link
Contributor

vishh commented Apr 24, 2018

FYI: This feature is being tackled via the ML Working Group.

@Nick-Harvey
Copy link

Nick-Harvey commented Apr 25, 2018

Here's my setup in case it's valuable for the continuation of this RFE

minikube version: v0.26.1
Kubernetes version being created: 1.10

Starting minikube:
minikube start --feature-gates=DevicePlugins=true --vm-driver none --feature-gates=Accelerators=true

Device plugin being installed
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml

Querying the node to see if it sees the GPU
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
minikube

If I understand things correctly the --vm-driver none leverages the existing docker runtime on the host to which I have set to nvidia-docker.

However, no matter what I do, I can't seem to get the node to recognize the GPU as an available resource. I know this isn't officially supported yet :) but I thought I'd contribute my env to help with the progression.

edit
figured it out, I was using the 1.9 nvidia device plugin rather than the 1.10. Once I changed those out, the node was recognizing the GPU.

@rohitagarwal003
Copy link
Member

/assign

@aclowkey
Copy link

@Nick-Harvey Didn't work for minikube v0.28.0

@rohitagarwal003
Copy link
Member

Hello,
I have a PR that adds GPU support to minikube #2936. It would be really helpful if people on this thread try it out. The instructions are in the PR.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

9 participants