No available image for pytorch on NVIDIA driver 418 #95

ligonzheng · 2019-09-04T12:29:35Z

My nvidia driver version is 418 and I tried to use byteps with docker but failed.
The behavior is below:
1, cuda10 + pytorch1.0.1 (image : bytepsimage/worker_pytorch_rdma:latest)
error : RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

2, cuda10 + pytorch1.2 (image : bytepsimage/worker_pytorch_rdma:latest; update pytorch in container)
error : ImportError: /usr/local/lib/python2.7/dist-packages/byteps-0.1.0-py2.7-linux-x86_64.egg/byteps/torch/c_lib.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE

3, cuda9 + pytorch 1.0.1 (image : bytepsimage/worker_pytorch:latest)
error : RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

4, cuda10 + pytorch1.0.1 (form Dockerfile : Dockerfile.worker.pytorch.cu100, only change "apt-get install libcudnn7"(without version))
error : could not finish bps.init() ,deadlock

5, cuda10 + pytorch1.2 (from Dockerfile Dockerfile.worker.pytorch.cu100, change libcudnn7 and python version )
error : could not finish bps.init() ,deadlock

So, it seems like pytorch version could not work while tensorflow version was working well.

ligonzheng · 2019-09-04T12:35:13Z

I use the " Step-by-Step Tutorial" -- Single Machine Training

bobzhuyb · 2019-09-04T17:33:11Z

Which GPU are you using? Can you paste the detail docker command you used for starting the container?

Once you install a new PyTorch version, you must uninstall byteps and install again. So it's expected that your 2nd case does not work.

Can you verify that, in your 1st and 3rd case, PyTorch does not work even without BytePS? I saw people have the same issue with RTX 2080 GPU. https://discuss.pytorch.org/t/a-error-when-using-gpu/32761/14

We also had a similar issue before, but later got resolved.
#20 (comment)

For your 4th and 5th case, did you try starting with just 1 GPU? Does it still deadlock?

ligonzheng · 2019-09-05T02:22:41Z

Very thank you for your detailed explanation.
I use 2080ti and start the docker command with specified shared memory as the document described.
I think there two kind of questions .
One is because of the pytorch. As case 1 and case 3, I commented the cudnn.benchmark = False , the error was not disappeared but the program is going on.
Another is because of the byteps. As case 4 and 5, the bps.init() get deadlocked. However , when I set NVIDIA_VISIBLE_DEVICE=0, the deadlock was solved. Why the byeps get deadlocked at init() when using the image from the Dockerfile ? Even I just only change the cudnn version which is because the original cudnn version could not be downloaded.

DeruiLiu mentioned this issue Aug 6, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No available image for pytorch on NVIDIA driver 418 #95

No available image for pytorch on NVIDIA driver 418 #95

ligonzheng commented Sep 4, 2019

ligonzheng commented Sep 4, 2019

bobzhuyb commented Sep 4, 2019

ligonzheng commented Sep 5, 2019

No available image for pytorch on NVIDIA driver 418 #95

No available image for pytorch on NVIDIA driver 418 #95

Comments

ligonzheng commented Sep 4, 2019

ligonzheng commented Sep 4, 2019

bobzhuyb commented Sep 4, 2019

ligonzheng commented Sep 5, 2019