Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x] CD cu102 110 test stage [Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU] #19948

Closed
Zha0q1 opened this issue Feb 24, 2021 · 4 comments
Labels

Comments

@Zha0q1
Copy link
Contributor

Zha0q1 commented Feb 24, 2021

This issue started to happen after we switched to the new ami for restricted-mxnetlinux-gpu, which has the newer nvidia driver 460.32.03.

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1553/pipeline

This happened to cu102 and cu110, but not cu 100 or 101. I was able to reproduce by basically building the same image as in the cd pipeline, using the same g3 instance and the same ami

docker build -f docker/Dockerfile.build.ubuntu_gpu_cu102 --build-arg USER_ID=1001 --build-arg GROUP_ID=1001 --cache-from 021742426385.dkr.ecr.us-west-2.amazonaws.com/mxnet-ci:build.ubuntu_gpu_cu102-81dcd5660530 -t 021742426385.dkr.ecr.us-west-2.amazonaws.com/mxnet-ci:build.ubuntu_gpu_cu102-81dcd5660530 docker

after entering the docker container I did

pip3 install mxnet-cu102

I was able to reproduce the exact error by running

>>> import mxnet
>>> import mxnet as mx
>>> ctx = mx.gpu(0)
>>> a = mx.nd.ones((100), ctx=ctx)
[02:55:03] src/base.cc:49: GPU context requested, but no GPUs found.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py", line 3295, in ones
    return _internal._ones(shape=shape, ctx=ctx, dtype=dtype, **kwargs)
  File "<string>", line 39, in _ones
  File "/usr/local/lib/python3.7/dist-packages/mxnet/_ctypes/ndarray.py", line 91, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "src/engine/threaded_engine.cc", line 331
MXNetError: Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Feb 24, 2021

CC @josephevans @leezu @ptrendx

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Feb 25, 2021

I think this is a nvidia driver issue? I tried

  1. switching the instance type from g3 (m60 gpu) to g4(t4 gpu)
  2. creating the same image on my p3 (v100 gpu) instance
  3. uninstalling 460 drivers and reinstalling 450/460 drivers

and none of these attempts worked.

I also tried pytorch cuda102 within the same docker container and it also could not find gpu

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Feb 25, 2021

on 1.x cd we use <cuda versoin>-devel-ubuntu16.04 as our base images. And I would enter the containers with --gpus all

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Mar 4, 2021

fixed

@Zha0q1 Zha0q1 closed this as completed Mar 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant