Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Hangs training on P100 #8695

Closed
amithr1 opened this issue Nov 17, 2017 · 4 comments
Closed

Hangs training on P100 #8695

amithr1 opened this issue Nov 17, 2017 · 4 comments

Comments

@amithr1
Copy link

amithr1 commented Nov 17, 2017

I am trying to train imagenet using the default resnet on a single node having upto 4 P100s.. When I use the master branch, I see hangs. When I attached gdb I see the following stack trace. If there are useful inputs, I can debug the problem more. The problem happens with more than 2 GPUs. With 2GPUs, I can run upto several epochs. However when I use 4 GPUs, it hangs within first epoch.

(gdb) bt
#0 0x00003fffac2cdd60 in pthread_cond_wait@@GLIBC_2.17 () at /lib64/libpthread.so.0
#1 0x00003fff4777608c in std::condition_variable::wait(std::unique_lockstd::mutex&) () at /lib64/libstdc++.so.6
#2 0x00003fff6a3e236c in std::condition_variable::waitmxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::__lambda18(std::unique_lockstd::mutex &, mxnet::engine::ThreadedEngine::__lambda18) (this=0x3fff2c001198, __lock=..., __p=...) at /usr/include/c++/4.8.2/condition_variable:93
#3 0x00003fff6a3e1d10 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) (this=0x3fff2c001150, var=0x3bff50a6a900) at src/engine/threaded_engine.cc:358
#4 0x00003fff699b6cc8 in mxnet::NDArray::WaitToWrite() const (this=0x3bff49fa0cf0) at include/mxnet/./ndarray.h:330
#5 0x00003fff69be4c88 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const (this=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at src/ndarray/ndarray.cc:1210
#6 0x00003fff6a44d190 in MXNDArraySyncCopyToCPU(NDArrayHandle, void*, size_t) (handle=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at src/c_api/c_api.cc:253
#7 0x00003fffabed7254 in () at /lib64/libffi.so.6
#8 0x00003fffabed5f50 in ffi_call () at /lib64/libffi.so.6
#9 0x00003fffa5247b24 in _ctypes_callproc () at /usr/lib64/python2.7/lib-dynload/_ctypes.so
#10 0x00003fffa523a6ac in PyCFuncPtr_call () at /usr/lib64/python2.7/lib-dynload/_ctypes.so
#11 0x00003fffac361444 in PyObject_Call () at /lib64/libpython2.7.so.1.0
#12 0x00003fffac4669f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#13 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#14 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#15 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#16 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#17 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#18 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#19 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#20 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#21 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#22 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#23 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#24 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#25 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#26 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#27 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#28 0x00003fffac46cc64 in PyEval_EvalCode () at /lib64/libpython2.7.so.1.0
#29 0x00003fffac4a0528 in PyRun_FileExFlags () at /lib64/libpython2.7.so.1.0
#30 0x00003fffac4a274c in PyRun_SimpleFileExFlags () at /lib64/libpython2.7.so.1.0
#31 0x00003fffac4a2e9c in PyRun_AnyFileExFlags () at /lib64/libpython2.7.so.1.0
#32 0x00003fffac4beb7c in Py_Main () at /lib64/libpython2.7.so.1.0
#33 0x0000000010000738 in main ()

@shesung
Copy link
Contributor

shesung commented Nov 21, 2017

I met the same problem when trainning with multiple machines.

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Dec 2, 2017 via email

@starimpact
Copy link
Contributor

i encountered the deadlock also....
mxnet version is v0.8.0

#0  pthread_cond_wait@@GLIBC_2.3.2 ()
#1  0x00007f39f58198dc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f39f7688559 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) ()
#3  0x00007f39f7701708 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const ()
#4  0x00007f39f765d4ba in MXNDArraySyncCopyToCPU ()

@eric-haibin-lin when is the deadlock fixed, can you give the md5 code of the commit or link of it?

@nswamy
Copy link
Member

nswamy commented Mar 21, 2018

Please try the latest version of MXNet and create a new issue if you encounter the problem again.

@nswamy nswamy closed this as completed Mar 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants