Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No available image for pytorch on NVIDIA driver 418 #95

Open
ligonzheng opened this issue Sep 4, 2019 · 3 comments
Open

No available image for pytorch on NVIDIA driver 418 #95

ligonzheng opened this issue Sep 4, 2019 · 3 comments

Comments

@ligonzheng
Copy link

My nvidia driver version is 418 and I tried to use byteps with docker but failed.
The behavior is below:
1, cuda10 + pytorch1.0.1 (image : bytepsimage/worker_pytorch_rdma:latest)
error : RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

2, cuda10 + pytorch1.2 (image : bytepsimage/worker_pytorch_rdma:latest; update pytorch in container)
error :  ImportError: /usr/local/lib/python2.7/dist-packages/byteps-0.1.0-py2.7-linux-x86_64.egg/byteps/torch/c_lib.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE

3, cuda9 + pytorch 1.0.1 (image : bytepsimage/worker_pytorch:latest)
error : RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

4, cuda10 + pytorch1.0.1 (form Dockerfile : Dockerfile.worker.pytorch.cu100, only change "apt-get install libcudnn7"(without version))
error : could not finish bps.init() ,deadlock

5, cuda10 + pytorch1.2 (from Dockerfile Dockerfile.worker.pytorch.cu100, change libcudnn7 and python version )
error : could not finish bps.init() ,deadlock

So, it seems like pytorch version could not work while tensorflow version was working well.

@ligonzheng
Copy link
Author

I use the " Step-by-Step Tutorial" -- Single Machine Training

@bobzhuyb
Copy link
Member

bobzhuyb commented Sep 4, 2019

Which GPU are you using? Can you paste the detail docker command you used for starting the container?

Once you install a new PyTorch version, you must uninstall byteps and install again. So it's expected that your 2nd case does not work.

Can you verify that, in your 1st and 3rd case, PyTorch does not work even without BytePS? I saw people have the same issue with RTX 2080 GPU. https://discuss.pytorch.org/t/a-error-when-using-gpu/32761/14

We also had a similar issue before, but later got resolved.
#20 (comment)

For your 4th and 5th case, did you try starting with just 1 GPU? Does it still deadlock?

@ligonzheng
Copy link
Author

Very thank you for your detailed explanation.
I use 2080ti and start the docker command with specified shared memory as the document described.
I think there two kind of questions .
One is because of the pytorch. As case 1 and case 3, I commented the cudnn.benchmark = False , the error was not disappeared but the program is going on.
Another is because of the byteps. As case 4 and 5, the bps.init() get deadlocked. However , when I set NVIDIA_VISIBLE_DEVICE=0, the deadlock was solved. Why the byeps get deadlocked at init() when using the image from the Dockerfile ? Even I just only change the cudnn version which is because the original cudnn version could not be downloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants