Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA runtime error when running with pytorch benchmark_byteps.py #20

Closed
un-knight opened this issue Jun 28, 2019 · 22 comments
Closed

CUDA runtime error when running with pytorch benchmark_byteps.py #20

un-knight opened this issue Jun 28, 2019 · 22 comments

Comments

@un-knight
Copy link
Contributor

un-knight commented Jun 28, 2019

Describe the bug
Got cuda runtime error when running with pytorch benchmark_byteps.py.

Error info:

BytePS launching worker
running benchmark...
Model: resnet50
Batch size: 32
Number of GPUs: 1
Running warmup...
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
  File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 109, in <module>
    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
    return Timer(stmt, setup, timer).timeit(number)
  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
    timing = self.inner(it, self.timer)
  File "/usr/lib/python2.7/timeit.py", line 100, in inner
    _func()
  File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 90, in benchmark_step
    output = model(data)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

To Reproduce
Steps to reproduce the behavior:
Following the step by step tutorial, and I use the bytepsimage/worker_pytorch image from official.

Environment (please complete the following information):
same as byteps official pytorch worker image.

Additional context
Add any other context about the problem here.

@bobzhuyb
Copy link
Member

With only 1 GPU, BytePS is not involved in the training at all.

That said, we'll double check. Would you provide more information about your OS and CUDA version outside docker?

@un-knight
Copy link
Contributor Author

un-knight commented Jun 28, 2019

@bobzhuyb Only 1 GPU is the reason for NVIDIA_VISIBLE_DEVICES setting. but with 4 GPU setting, the same error occurred.

There is some information about my host device:

OS: CentOS Linux release 7.6.1810 (Core)
CUDA: 10.0.130
nvidia driver: 418.43

There is information about docker image:

OS: Ubuntu 16.04
CUDA: 9.0.176

Maybe the problem is CUDA version too low compared to the host driver? I will update cuda to 10.x version then have a try.

@changlan
Copy link
Contributor

Yes, It seems to be pytorch/cuda issue. I'd say try installing cuda10 pytorch version as well, since you are using a cutting edge NVIDIA driver.

@bobzhuyb
Copy link
Member

Which GPU model are you using? I searched for the error output a bit, and found some similar cases. It's possible that you are using GPUs that can only run with cuda10, however we provide cuda9 in the docker image.

@un-knight
Copy link
Contributor Author

I build a new docker image with cuda10.0 and also pin gcc to 4.9, but when I run the pytorch benchmark script I get a core dump error:

[[[[2019-06-28 10:33:492019-06-28 10:33:492019-06-28 10:33:492019-06-28 10:33:49..774633774628.: 774590: : F.FF 774628  : byteps/common/shared_memory.ccbyteps/common/shared_memory.ccbyteps/common/shared_memory.cc::F3939 :] byteps/common/shared_memory.cc] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid argument39Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid argument:] 
39Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid argument
] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid argument

/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:    35 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/anaconda/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 19, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/anaconda/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134.

Image environment:

PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 418.43
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.16.2
[pip] numpydoc==0.8.0
[pip] torch==1.0.1.post2
[pip] torchvision==0.2.2
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.3                      199  
[conda] mkl-service               1.1.2            py37he904b0f_5  
[conda] mkl_fft                   1.0.10           py37ha843d7b_0  
[conda] mkl_random                1.0.2            py37hd81dba3_0  
[conda] pytorch                   1.0.1           py3.7_cuda10.0.130_cudnn7.4.2_2    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchvision               0.2.2                      py_3    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch

@bobzhuyb
Copy link
Member

bobzhuyb commented Jun 28, 2019

Thanks for the detailed info. Hmmm... looks like 2080 Ti is causing some troubles. First, it requires CUDA 10. Second, it seems that it is having some problems when doing cudaHostRegister

The problem is we don't have this card on hand. Would you do us a favor -- comment out this line https://github.com/bytedance/byteps/blob/master/byteps/common/shared_memory.cc#L39 and try again?

@bobzhuyb
Copy link
Member

Can you show us the output of ipcs -lm ?

@un-knight
Copy link
Contributor Author

@bobzhuyb Of cause, there is the output of ipcs:

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

@bobzhuyb
Copy link
Member

@un-knight Okay. you have enough share memory. Did you add --shm-size=32768m to your docker run command, like shown in https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md? If so, you don't have a problem on share memory.

Then the only possible problem is the cudaHostRegister() call with 2080 Ti.

@un-knight
Copy link
Contributor Author

un-knight commented Jun 30, 2019

Great! now I can run the benchmark with single node after expending the docker share memory. So for a conclusion, with 2080Ti, the user needs to install cuda>=10.0, then set a proper share memory for processes to communication. Thanks for your help @bobzhuyb

@bobzhuyb
Copy link
Member

Good to know. We'll build cuda10 image/package soon, so that future users don't have this problem.

@bobzhuyb
Copy link
Member

Closing this issue. Feel free to reopen it if anything comes up.

@un-knight
Copy link
Contributor Author

Another problem is the benchmark processes can't stop automatically after it finished a task, and the GPU memory can't be released as well. So in this case, I have to kill the process manually.

@ymjiang
Copy link
Member

ymjiang commented Jun 30, 2019

@un-knight Thank you for the feedback. We will take a look at the exit problem.

@changlan
Copy link
Contributor

@un-knight
Copy link
Contributor Author

@bobzhuyb got an illegal memory access problem:

[2019-07-08 11:29:21.757838: F byteps/common/nccl_manager.cc:35] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
Aborted (core dumped)
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/anaconda/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 19, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/anaconda/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python synthetic_benchmark.py byteps' returned non-zero exit status 134.

@bobzhuyb
Copy link
Member

bobzhuyb commented Jul 8, 2019

@un-knight In what scenario? Single machine, or distributed
mode? Does it happen immediately after starting, or it will run for some iterations
?

@un-knight
Copy link
Contributor Author

un-knight commented Jul 8, 2019

@un-knight In what scenario? Single machine, or distributed
mode? Does it happen immediately after starting, or it will run for some iterations
?

It's very strange that the error happens after some iterations with single machine multi-gpus, and I hadn't test it with multi-nodes yet.

The case is the synthetic pytorch benchmark could work normally while the mnist pytorch example could get an illegal memory error mentioned above after some iterations.

@bobzhuyb
Copy link
Member

bobzhuyb commented Jul 8, 2019

@un-knight In what scenario? Single machine, or distributed
mode? Does it happen immediately after starting, or it will run for some iterations
?

It's very strange that the error happens after some iterations with single machine multi-gpus, and I hadn't test it with multi-nodes yet.

The case is the synthetic pytorch benchmark could work normally while the mnist pytorch example could get an illegal memory error mentioned above after some iterations.

some iterations -- Does it always fail at the same iteration? If so, I tend to think this is the example script problem. Otherwise, it may be something in BytePS's core logic.

How many iterations can it run before it fails?

@ymjiang
Copy link
Member

ymjiang commented Jul 9, 2019

@un-knight The mnist example of pytorch runs 10 epochs by default. Did your problem happen after 10 epochs? If so, it could be due to that BytePS does not handle the exit properly.

@un-knight
Copy link
Contributor Author

un-knight commented Jul 9, 2019

@un-knight The mnist example of pytorch runs 10 epochs by default. Did your problem happen after 10 epochs? If so, it could be due to that BytePS does not handle the exit properly.

@ymjiang @bobzhuyb It happens after 1 epoch every time in fact when I run the mnist exampel.

Train Epoch: 1 [14720/15000 (98%)]      Loss: 0.416857
Train Epoch: 1 [14720/15000 (98%)]      Loss: 0.351872
Train Epoch: 1 [14720/15000 (98%)]      Loss: 0.524274
/opt/anaconda/lib/python3.7/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/opt/anaconda/lib/python3.7/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/opt/anaconda/lib/python3.7/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/opt/anaconda/lib/python3.7/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
[2019-07-09 02:17:22. 81728: F byteps/common/nccl_manager.cc:35] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   440 Aborted                 (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/anaconda/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 19, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/anaconda/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134.

@un-knight
Copy link
Contributor Author

So would this problem be a byteps core error?

pleasantrabbit pushed a commit that referenced this issue Jul 13, 2020
* hotfix: update script

* hotfix: comment out....
pleasantrabbit pushed a commit that referenced this issue Nov 3, 2020
* add testcase for mixed mode

* add server load

* fix log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants