Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

mxnet gets stuck on cudaMemGetInfo #6281

Closed
conopt opened this issue May 16, 2017 · 7 comments
Closed

mxnet gets stuck on cudaMemGetInfo #6281

conopt opened this issue May 16, 2017 · 7 comments

Comments

@conopt
Copy link
Contributor

conopt commented May 16, 2017

Environment info

Operating System: CentOS with cuda V8.0.61

Compiler: g++ 5.3.1

MXNet commit hash (git rev-parse HEAD): 3d545d7

Steps to reproduce

  1. cd cpp-package/example
  2. ./get_mnist.sh
  3. make mlp_gpu && ./mlp_gpu

Part of gdb backtrace:
#0 0x00007fff5d718990 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#1 0x00007fff5d718ac6 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#2 0x00007fff5d778e8a in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#3 0x00007fff5d71fecb in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#4 0x00007fff5d99becf in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#5 0x00007fff5d99bf39 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#6 0x00007fff5d5eed6d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#7 0x00007fff5d5f64f8 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#8 0x00007fff5dbf140d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#9 0x00007fff5d5f9b94 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#10 0x00007fff5d5fb2e9 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#11 0x00007fff5d5f1abc in _cuda_CallJitEntryPoint ()
from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#12 0x00007fffc4bff582 in fatBinaryCtl_Compile ()
from /usr/lib64/nvidia/libnvidia-fatbinaryloader.so.375.26
#13 0x00007fffd3625e42 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#14 0x00007fffd36269c3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#15 0x00007fffd357f35e in ?? () from /usr/lib64/nvidia/libcuda.so.1
#16 0x00007fffd357f640 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#17 0x00007fffe30dfa5d in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#18 0x00007fffe30d3e60 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#19 0x00007fffe30decc6 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#20 0x00007fffe30e3401 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#21 0x00007fffe30d672e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#22 0x00007fffe30c3e8e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#23 0x00007fffe30f417c in cudaMemGetInfo () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#24 0x00007fffe652aea5 in mxnet::storage::GPUPooledStorageManager::Alloc (this=0xa5fe80,
raw_size=401408) at src/storage/./pooled_storage_manager.h:77
#25 0x00007fffe652b3f9 in mxnet::StorageImpl::Alloc (this=0x7fff6c0052d0, size=401408, ctx=...)
at src/storage/storage.cc:86
#26 0x00007fffe6010bfa in mxnet::NDArray::Chunk::CheckAndAlloc (this=0xa6c790)
at include/mxnet/./ndarray.h:391
#27 0x00007fffe6010bb5 in mxnet::NDArray::Chunk::Chunk (this=0xa6c790, size=100352, ctx=...,
delay_alloc
=false, dtype=0) at include/mxnet/./ndarray.h:386

It only stuck on cuda 8.0.61. I tried another machine with cuda 8.0.44 and it worked well.

@sifmelcara
Copy link
Contributor

Cannot reproduce, I have cuda 8.0.61 and nvidia driver 378.13.
Maybe the issue is only relevant to nvidia driver version?
Sorry I cannot test 375.26 since it do not support my GPU.

@conopt
Copy link
Contributor Author

conopt commented May 18, 2017

yes, my driver version is 375.26...

@conopt
Copy link
Contributor Author

conopt commented May 18, 2017

@sifmelcara just tried 375.51. it takes 6 min to alloc memory...

@sifmelcara
Copy link
Contributor

Do you mean the mlp_gpu example? That is pretty strange. (It only takes 2~3 seconds on my machine)
Maybe run some gpu/cuda stress test program to help identify the problem?

@conopt
Copy link
Contributor Author

conopt commented May 18, 2017

I know what happened.. This issue is caused by cuda jit. I checked default configuration of Makefile, the gen_code option for nvcc doesn't contain CUDA compute 61 arch so every kernel is converted to compute 61 in runtime because my gpu arch is 61. As mentioned in https://groups.google.com/d/msg/arrayfire-users/D3RORyrvn4s/N7AoKueSCAAJ, the conversion takes in the order of minutes, which matches my observation. It works well after adding sm_61 to gen_code option.

@piiswrong cmake handles different cuda archs correctly. (https://github.com/dmlc/mshadow/blob/master/cmake/Cuda.cmake) Do you think it's good idea to port that part to Makefile? I don't know how does pip release handle this problem, but as c++ users need to compile by themselves, it makes life easier to check gpu archs automatically.

@szha
Copy link
Member

szha commented May 19, 2017

pip packages rely on NVRTC and build for all archs are turned off.

@szha
Copy link
Member

szha commented Jun 7, 2017

@lx75249 is this still on-going?

@conopt conopt closed this as completed Jun 7, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants