mxnet gets stuck on cudaMemGetInfo #6281

conopt · 2017-05-16T20:40:26Z

Environment info

Operating System: CentOS with cuda V8.0.61

Compiler: g++ 5.3.1

MXNet commit hash (git rev-parse HEAD): 3d545d7

Steps to reproduce

cd cpp-package/example
./get_mnist.sh
make mlp_gpu && ./mlp_gpu

Part of gdb backtrace:
#0 0x00007fff5d718990 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#1 0x00007fff5d718ac6 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#2 0x00007fff5d778e8a in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#3 0x00007fff5d71fecb in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#4 0x00007fff5d99becf in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#5 0x00007fff5d99bf39 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#6 0x00007fff5d5eed6d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#7 0x00007fff5d5f64f8 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#8 0x00007fff5dbf140d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#9 0x00007fff5d5f9b94 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#10 0x00007fff5d5fb2e9 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#11 0x00007fff5d5f1abc in _cuda_CallJitEntryPoint ()
from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#12 0x00007fffc4bff582 in fatBinaryCtl_Compile ()
from /usr/lib64/nvidia/libnvidia-fatbinaryloader.so.375.26
#13 0x00007fffd3625e42 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#14 0x00007fffd36269c3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#15 0x00007fffd357f35e in ?? () from /usr/lib64/nvidia/libcuda.so.1
#16 0x00007fffd357f640 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#17 0x00007fffe30dfa5d in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#18 0x00007fffe30d3e60 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#19 0x00007fffe30decc6 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#20 0x00007fffe30e3401 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#21 0x00007fffe30d672e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#22 0x00007fffe30c3e8e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#23 0x00007fffe30f417c in cudaMemGetInfo () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#24 0x00007fffe652aea5 in mxnet::storage::GPUPooledStorageManager::Alloc (this=0xa5fe80,
raw_size=401408) at src/storage/./pooled_storage_manager.h:77
#25 0x00007fffe652b3f9 in mxnet::StorageImpl::Alloc (this=0x7fff6c0052d0, size=401408, ctx=...)
at src/storage/storage.cc:86
#26 0x00007fffe6010bfa in mxnet::NDArray::Chunk::CheckAndAlloc (this=0xa6c790)
at include/mxnet/./ndarray.h:391
#27 0x00007fffe6010bb5 in mxnet::NDArray::Chunk::Chunk (this=0xa6c790, size=100352, ctx=...,
delay_alloc=false, dtype=0) at include/mxnet/./ndarray.h:386

It only stuck on cuda 8.0.61. I tried another machine with cuda 8.0.44 and it worked well.

The text was updated successfully, but these errors were encountered:

sifmelcara · 2017-05-18T05:40:10Z

Cannot reproduce, I have cuda 8.0.61 and nvidia driver 378.13.
Maybe the issue is only relevant to nvidia driver version?
Sorry I cannot test 375.26 since it do not support my GPU.

conopt · 2017-05-18T05:42:24Z

yes, my driver version is 375.26...

conopt · 2017-05-18T14:33:43Z

@sifmelcara just tried 375.51. it takes 6 min to alloc memory...

sifmelcara · 2017-05-18T14:42:34Z

Do you mean the mlp_gpu example? That is pretty strange. (It only takes 2~3 seconds on my machine)
Maybe run some gpu/cuda stress test program to help identify the problem?

conopt · 2017-05-18T17:11:25Z

I know what happened.. This issue is caused by cuda jit. I checked default configuration of Makefile, the gen_code option for nvcc doesn't contain CUDA compute 61 arch so every kernel is converted to compute 61 in runtime because my gpu arch is 61. As mentioned in https://groups.google.com/d/msg/arrayfire-users/D3RORyrvn4s/N7AoKueSCAAJ, the conversion takes in the order of minutes, which matches my observation. It works well after adding sm_61 to gen_code option.

@piiswrong cmake handles different cuda archs correctly. (https://github.com/dmlc/mshadow/blob/master/cmake/Cuda.cmake) Do you think it's good idea to port that part to Makefile? I don't know how does pip release handle this problem, but as c++ users need to compile by themselves, it makes life easier to check gpu archs automatically.

szha · 2017-05-19T20:53:07Z

pip packages rely on NVRTC and build for all archs are turned off.

szha · 2017-06-07T00:11:55Z

@lx75249 is this still on-going?

conopt closed this as completed Jun 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mxnet gets stuck on cudaMemGetInfo #6281

mxnet gets stuck on cudaMemGetInfo #6281

conopt commented May 16, 2017

sifmelcara commented May 18, 2017

conopt commented May 18, 2017

conopt commented May 18, 2017

sifmelcara commented May 18, 2017

conopt commented May 18, 2017

szha commented May 19, 2017

szha commented Jun 7, 2017

mxnet gets stuck on cudaMemGetInfo #6281

mxnet gets stuck on cudaMemGetInfo #6281

Comments

conopt commented May 16, 2017

Environment info

Steps to reproduce

sifmelcara commented May 18, 2017

conopt commented May 18, 2017

conopt commented May 18, 2017

sifmelcara commented May 18, 2017

conopt commented May 18, 2017

szha commented May 19, 2017

szha commented Jun 7, 2017