[v1.x] CD cu102 110 test stage [Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU] #19948

Zha0q1 · 2021-02-24T02:59:31Z

This issue started to happen after we switched to the new ami for restricted-mxnetlinux-gpu, which has the newer nvidia driver 460.32.03.

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1553/pipeline

This happened to cu102 and cu110, but not cu 100 or 101. I was able to reproduce by basically building the same image as in the cd pipeline, using the same g3 instance and the same ami

docker build -f docker/Dockerfile.build.ubuntu_gpu_cu102 --build-arg USER_ID=1001 --build-arg GROUP_ID=1001 --cache-from 021742426385.dkr.ecr.us-west-2.amazonaws.com/mxnet-ci:build.ubuntu_gpu_cu102-81dcd5660530 -t 021742426385.dkr.ecr.us-west-2.amazonaws.com/mxnet-ci:build.ubuntu_gpu_cu102-81dcd5660530 docker

after entering the docker container I did

pip3 install mxnet-cu102

I was able to reproduce the exact error by running

>>> import mxnet
>>> import mxnet as mx
>>> ctx = mx.gpu(0)
>>> a = mx.nd.ones((100), ctx=ctx)

[02:55:03] src/base.cc:49: GPU context requested, but no GPUs found.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py", line 3295, in ones
    return _internal._ones(shape=shape, ctx=ctx, dtype=dtype, **kwargs)
  File "<string>", line 39, in _ones
  File "/usr/local/lib/python3.7/dist-packages/mxnet/_ctypes/ndarray.py", line 91, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "src/engine/threaded_engine.cc", line 331
MXNetError: Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU

The text was updated successfully, but these errors were encountered:

Zha0q1 · 2021-02-24T02:59:56Z

CC @josephevans @leezu @ptrendx

Zha0q1 · 2021-02-25T06:35:31Z

I think this is a nvidia driver issue? I tried

switching the instance type from g3 (m60 gpu) to g4(t4 gpu)
creating the same image on my p3 (v100 gpu) instance
uninstalling 460 drivers and reinstalling 450/460 drivers

and none of these attempts worked.

I also tried pytorch cuda102 within the same docker container and it also could not find gpu

Zha0q1 · 2021-02-25T06:37:04Z

on 1.x cd we use <cuda versoin>-devel-ubuntu16.04 as our base images. And I would enter the containers with --gpus all

Zha0q1 · 2021-03-04T18:42:10Z

fixed

Zha0q1 added Bug needs triage labels Feb 24, 2021

Zha0q1 removed the needs triage label Feb 24, 2021

Zha0q1 mentioned this issue Feb 25, 2021

[v.1x] Attempt to fix v1.x cd by installing new cuda compt package #19959

Merged

Zha0q1 closed this as completed Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x] CD cu102 110 test stage [Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU] #19948

[v1.x] CD cu102 110 test stage [Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU] #19948

Zha0q1 commented Feb 24, 2021 •

edited

Loading

Zha0q1 commented Feb 24, 2021

Zha0q1 commented Feb 25, 2021

Zha0q1 commented Feb 25, 2021 •

edited

Loading

Zha0q1 commented Mar 4, 2021

[v1.x] CD cu102 110 test stage [Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU] #19948

[v1.x] CD cu102 110 test stage [Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU] #19948

Comments

Zha0q1 commented Feb 24, 2021 • edited Loading

Zha0q1 commented Feb 24, 2021

Zha0q1 commented Feb 25, 2021

Zha0q1 commented Feb 25, 2021 • edited Loading

Zha0q1 commented Mar 4, 2021

Zha0q1 commented Feb 24, 2021 •

edited

Loading

Zha0q1 commented Feb 25, 2021 •

edited

Loading