Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Mxnet gets stuck when run example/image-classification/train_mnist.py #1468

Closed
Azure-Vani opened this issue Feb 14, 2016 · 2 comments
Closed

Comments

@Azure-Vani
Copy link
Contributor

The process gets stuck when I run train_mnist.py in Ubuntu 15.04. I only change config.mk to use openblas as backend before build mxnet itself.

The last few lines in output of strace python train_mnist.py is:

clone(child_stack=0x7f71d3ffeff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f71d3fff9d0, tls=0x7f71d3fff700, child_tidptr=0x7f71d3fff9d0) = 2337
futex(0x24533cc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24533c8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
brk(0x292f000) = 0x292f000
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=414, ...}) = 0
write(2, "2016-02-14 18:32:45,836 Node[0] "..., 612016-02-14 18:32:45,836 Node[0] Start training with [cpu(0)]
) = 61
brk(0x2992000) = 0x2992000
brk(0x29f4000) = 0x29f4000
brk(0x2a5e000) = 0x2a5e000
brk(0x2acd000) = 0x2acd000
futex(0x23e0aa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x23e0aa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x23e0a70, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x23e0ad0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x23e0ad4, FUTEX_WAIT_PRIVATE, 3, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x23e0a70, FUTEX_WAKE_PRIVATE, 1) = 0
brk(0x2aee000) = 0x2aee000
futex(0x7f71dc001260, FUTEX_WAKE_PRIVATE, 1) = 1
brk(0x2b50000) = 0x2b50000
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x23e0aa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x23e0aa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 3, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x23e0aa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x23e0aa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 5, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 7, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 9, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x23e0aa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x23e0aa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 11, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 13, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 15, NULL) = 0
futex(0x7f71dc000e40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x23e0aa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x23e0aa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f71dc000e6c, FUTEX_WAIT_PRIVATE, 17, NULL

It seems get stuck on a mutex. Anyone knows what happens?

@Azure-Vani
Copy link
Contributor Author

And here is the gdb backtrace output at the stuck point:

(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fffd5890d1c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007fffd7048373 in std::condition_variable::waitmxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::<lambda() >(std::unique_lockstd::mutex &, mxnet::engine::ThreadedEngine::<lambda()>) (this=0x7fffb4000e68, __lock=..., __p=...) at /usr/include/c++/4.9/condition_variable:98
#3 0x00007fffd7047dab in mxnet::engine::ThreadedEngine::WaitForVar (this=0x7fffb4000e30, var=0x1569750) at src/engine/threaded_engine.cc:321
#4 0x00007fffd6d97c60 in mxnet::NDArray::WaitToRead (this=0x1573190) at include/mxnet/./ndarray.h:96
#5 0x00007fffd6ecf489 in mxnet::NDArray::SyncCopyToCPU (this=0x1573190, data=0x15744c0, size=128) at src/ndarray/ndarray.cc:635
#6 0x00007fffd70c886d in MXNDArraySyncCopyToCPU (handle=0x1573190, data=0x15744c0, size=128) at src/c_api/c_api.cc:190
#7 0x00007ffff63bfd90 in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#8 0x00007ffff63bf7f8 in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#9 0x00007ffff65cf0a5 in _ctypes_callproc () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#10 0x00007ffff65d3a42 in ?? () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#11 0x00000000004cd9ab in PyEval_EvalFrameEx ()
#12 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#13 0x00000000004ce7d3 in PyEval_EvalFrameEx ()
#14 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#15 0x00000000004ce7d3 in PyEval_EvalFrameEx ()
#16 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#17 0x00000000004ce7d3 in PyEval_EvalFrameEx ()
#18 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#19 0x00000000004ce7d3 in PyEval_EvalFrameEx ()
#20 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#21 0x00000000004cd217 in PyEval_EvalFrameEx ()
#22 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#23 0x00000000004cd217 in PyEval_EvalFrameEx ()
#24 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#25 0x00000000004cd217 in PyEval_EvalFrameEx ()
#26 0x00000000004cb6b1 in PyEval_EvalCodeEx ()
#27 0x000000000050481f in ?? ()
#28 0x00000000004fc182 in PyRun_FileExFlags ()
#29 0x00000000004fb247 in PyRun_SimpleFileExFlags ()
#30 0x000000000049aa6e in Py_Main ()
#31 0x00007ffff7811a40 in __libc_start_main (main=0x49a500

, argc=2, argv=0x7fffffffe408, init=, fini=,
rtld_fini=, stack_end=0x7fffffffe3f8) at libc-start.c:289
#32 0x000000000049a429 in _start ()

@szha
Copy link
Member

szha commented Sep 28, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

@szha szha closed this as completed Sep 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants