-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA runtime error when running with pytorch benchmark_byteps.py #20
Comments
With only 1 GPU, BytePS is not involved in the training at all. That said, we'll double check. Would you provide more information about your OS and CUDA version outside docker? |
@bobzhuyb Only 1 GPU is the reason for There is some information about my host device:
There is information about docker image:
Maybe the problem is CUDA version too low compared to the host driver? I will update cuda to 10.x version then have a try. |
Yes, It seems to be pytorch/cuda issue. I'd say try installing cuda10 pytorch version as well, since you are using a cutting edge NVIDIA driver. |
Which GPU model are you using? I searched for the error output a bit, and found some similar cases. It's possible that you are using GPUs that can only run with cuda10, however we provide cuda9 in the docker image. |
I build a new docker image with cuda10.0 and also pin gcc to 4.9, but when I run the pytorch benchmark script I get a core dump error:
Image environment:
|
Thanks for the detailed info. Hmmm... looks like 2080 Ti is causing some troubles. First, it requires CUDA 10. Second, it seems that it is having some problems when doing cudaHostRegister The problem is we don't have this card on hand. Would you do us a favor -- comment out this line https://github.com/bytedance/byteps/blob/master/byteps/common/shared_memory.cc#L39 and try again? |
Can you show us the output of |
@bobzhuyb Of cause, there is the output of ipcs:
|
@un-knight Okay. you have enough share memory. Did you add --shm-size=32768m to your docker run command, like shown in https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md? If so, you don't have a problem on share memory. Then the only possible problem is the cudaHostRegister() call with 2080 Ti. |
Great! now I can run the benchmark with single node after expending the docker share memory. So for a conclusion, with 2080Ti, the user needs to install cuda>=10.0, then set a proper share memory for processes to communication. Thanks for your help @bobzhuyb |
Good to know. We'll build cuda10 image/package soon, so that future users don't have this problem. |
Closing this issue. Feel free to reopen it if anything comes up. |
Another problem is the benchmark processes can't stop automatically after it finished a task, and the GPU memory can't be released as well. So in this case, I have to kill the process manually. |
@un-knight Thank you for the feedback. We will take a look at the exit problem. |
|
@bobzhuyb got an illegal memory access problem:
|
@un-knight In what scenario? Single machine, or distributed |
It's very strange that the error happens after some iterations with single machine multi-gpus, and I hadn't test it with multi-nodes yet. The case is the synthetic pytorch benchmark could work normally while the mnist pytorch example could get an illegal memory error mentioned above after some iterations. |
How many iterations can it run before it fails? |
@un-knight The mnist example of pytorch runs 10 epochs by default. Did your problem happen after 10 epochs? If so, it could be due to that BytePS does not handle the exit properly. |
@ymjiang @bobzhuyb It happens after 1 epoch every time in fact when I run the mnist exampel.
|
So would this problem be a byteps core error? |
* hotfix: update script * hotfix: comment out....
* add testcase for mixed mode * add server load * fix log
Describe the bug
Got cuda runtime error when running with pytorch benchmark_byteps.py.
Error info:
To Reproduce
Steps to reproduce the behavior:
Following the step by step tutorial, and I use the
bytepsimage/worker_pytorch
image from official.Environment (please complete the following information):
same as byteps official pytorch worker image.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: