Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All builds on Cuda 11.2.0 fail because Cog references a non-existing Docker image #1162

Closed
andreemic opened this issue Jul 1, 2023 · 9 comments

Comments

@andreemic
Copy link

Hey! Since yesterday I'm unable to build any image using Cog and therefore deploy any models to Replicate.

As far as I understand, an external Nvidia Docker image which is referenced here got removed from Docker Hub and causes builds on certain CUDA versions to fail.

If so, this prevents all Lambda Labs users, and probably more people, from using Cog and deploying to Replicate.

I am not entirely confident in my explanation but maybe we can figure out together.

The Error:

image

@andreemic
Copy link
Author

@mattt

@jabedbd
Copy link

jabedbd commented Jul 1, 2023

facing the same issue today. can not push any model to the replicate due to this issue.

@zba
Copy link

zba commented Jul 1, 2023

i add this version of cuda to build and it seems fixed it, at least temporary.

build:
  cuda: "11.2.2"

@zsxkib
Copy link

zsxkib commented Jul 2, 2023

I was dealing with this too! It was weird because everything was working fine and then cog build randomly stopped working. Upon further inspection, I think Nvidia have edited/removed this image.

It even fails on Cog's Hello World example:

$ mkdir hello-world-cog
$ cd hello-world-cog/
$ cog init

Setting up the current directory for use with Cog...

✅ Created /home/sakib/hello-world-cog/cog.yaml
✅ Created /home/sakib/hello-world-cog/predict.py

Done! For next steps, check out the docs at https://cog.run/docs/getting-started
$ sed -i 's/gpu: false/gpu: true/g' cog.yaml
$ cat cog.yaml
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md

build:
  # set to true if your model requires a GPU
  gpu: true

  # a list of ubuntu apt packages to install
  # system_packages:
    # - "libgl1-mesa-glx"
    # - "libglib2.0-0"

  # python version in the form '3.8' or '3.8.12'
  python_version: "3.8"

  # a list of packages in the format <package-name>==<version>
  # python_packages:
    # - "numpy==1.19.4"
    # - "torch==1.8.0"
    # - "torchvision==0.9.0"

  # commands run after the environment is setup
  # run:
    # - "echo env is ready!"
    # - "echo another command if needed"

# predict.py defines how predictions are run on your model
predict: "predict.py:Predictor"
$ sudo cog build --no-cache --debug
Setting CUDA to version 11.2
Setting CuDNN to version 11.2
Building Docker image from environment in cog.yaml as cog-hello-world-cog...
$ docker build --no-cache --file - --build-arg BUILDKIT_INLINE_CACHE=1 --tag cog-hello-world-cog --progress auto .
[+] Building 0.8s (7/7) FINISHED
 => [internal] load build definition from Dockerfile                                                    0.1s
 => => transferring dockerfile: 1.74kB                                                                  0.0s
 => [internal] load .dockerignore                                                                       0.1s
 => => transferring context: 2B                                                                         0.0s
 => resolve image config for docker.io/docker/dockerfile:1.2                                            0.2s
 => CACHED docker-image://docker.io/docker/dockerfile:1.2@sha256:e2a8561e419ab1ba6b2fe6cbdf49fd92b9591  0.0s
 => [internal] load .dockerignore                                                                       0.0s
 => [internal] load build definition from Dockerfile                                                    0.0s
 => ERROR [internal] load metadata for docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04            0.1s
------
 > [internal] load metadata for docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04:
------
Dockerfile:1
--------------------
   1 | >>> # syntax = docker/dockerfile:1.2
   2 |     FROM nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04
   3 |     ENV DEBIAN_FRONTEND=noninteractive
--------------------
ERROR: failed to solve: docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04: not found
ⅹ Failed to build Docker image: exit status 1

$ sudo docker pull docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04
Error response from daemon: manifest for nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04 not found: manifest unknown: manifest unknown

image

I agree w/ @zba adding a specific cuda version fixed the problem for me, I used...

build:
  gpu: true
  cuda: "11.1.1"
...
  python_version: "3.7"
...

@mattt
Copy link
Contributor

mattt commented Jul 3, 2023

Hi @andreemic. Thanks for reporting this. I'm working on a fix now with #1163, and hope to get that shipped later today. In the meantime, I recommend using the workaround suggested by @zba and @zsxkib.

garfieldnate added a commit to garfieldnate/whisper-ts-cog that referenced this issue Jul 4, 2023
@mattt
Copy link
Contributor

mattt commented Jul 4, 2023

@andreemic @zba @zsxkib @jabedbd @garfieldnate I just released Cog v0.8.0-beta9, which should resolve the problems caused by these missing CUDA images.

I tested with a fresh cog init project with gpu: true and was able to build successfully. Please give this release a try and let me know how it's working for you. Thanks! 🙇

@mattt mattt closed this as completed Jul 4, 2023
@mattt
Copy link
Contributor

mattt commented Jul 4, 2023

Update: I found some problems with the changes from #1170 that were released in Cog v0.8.0-beta9. I have a new PR open with #1175 that should resolve them. This is now available to test in v0.8.0-beta10

Thanks, everyone, for your patience.

@mattt mattt reopened this Jul 4, 2023
@IYAIB
Copy link

IYAIB commented Jul 4, 2023

I was also getting the CUDA error, I tried the Cog v0.8.0-beta9 with the "getting started" example but that's the output:

/usr/local/bin/cog: line 8: syntax error near unexpected token `newline'
/usr/local/bin/cog: line 8: `<!DOCTYPE html>'

@mattt
Copy link
Contributor

mattt commented Jul 5, 2023

I've confirmed with a few users that v0.8.0-beta111 fixes this issue. If you're still seeing this behavior with Beta 11, please let me know. Thank you!

Footnotes

  1. The only change between Betas 10 and 11 is a fix to goreleaser to get built artifacts for our GitHub releases working again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants