Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Deadlock happend while calling MXNDArraySyncCopyToCPU() ? #12923

Closed
coconutyao opened this issue Oct 23, 2018 · 8 comments
Closed

Deadlock happend while calling MXNDArraySyncCopyToCPU() ? #12923

coconutyao opened this issue Oct 23, 2018 · 8 comments

Comments

@coconutyao
Copy link

We have been troubled by the problem for a few days, so we need everyone's help, thank you!

Environment
GPU: Tesla P4; CPU: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz.

Appearance
The program receives the Image data as a server. After a period of time, the program starts to appear similar to Deadlock (may be caused by some requests, but cannot be accurately reproduced)

We tested on mxnet versions 1.0, 1.2, and 1.3, and the program showed the same appearance.

Program running process
We called the python engine in a C++ multithreaded program that uses the mxnet-python api. As can be seen from the stack information, MXNDArraySyncCopyToCPU() waits for a condition variable during execution, and the program will always be stuck in this place.

Stack information
Thread 85 (Thread 0x7f3cba52f700 (LWP 41394)):
#0 0x00007f3d582fd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f3d580979bc in __gthread_cond_wait (__mutex=, __cond=) at /data/home/xxx/gcc-build/gcc-4.9.4/build/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3 0x00007f3c7bcb86d5 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#4 0x00007f3c7bd94b4d in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#5 0x00007f3c7be7e9c3 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#6 0x00007f3c7bc516db in MXNDArraySyncCopyToCPU () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#7 0x00007f3d53e15adc in ffi_call_unix64 () from my_app/libs/./libffi.so.6
#8 0x00007f3d53e15282 in ffi_call () from my_app/libs/./libffi.so.6
#9 0x00007f3bfdd09376 in _call_function_pointer (argcount=3, resmem=0x7f3b3c1c4040, restype=, atypes=, avalues=0x7f3b3c1c4010, pProc=0x7f3c7bc516b0 , flags=4353) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/callproc.c:841
#10 _ctypes_callproc (pProc=0x7f3c7bc516b0 , argtuple=0x7f3b3c1c4130, flags=4353, argtypes=, restype=0x1616b80, checker=0x0) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/callproc.c:1184
#11 0x00007f3bfdd00db3 in PyCFuncPtr_call (self=, inargs=, kwds=0x0) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/_ctypes.c:3979
#12 0x00007f3d52c42e93 in PyObject_Call (func=0x7f3d2a11a050, arg=, kw=) at Objects/abstract.c:2547
#13 0x00007f3d52cf580d in do_call (nk=, na=, pp_stack=0x7f3b3c1c43b8, func=0x7f3d2a11a050) at Python/ceval.c:4569
#14 call_function (oparg=, pp_stack=0x7f3b3c1c43b8) at Python/ceval.c:4374
#15 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#16 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d3f730030, globals=, locals=, args=, argcount=1, kws=0x7f3d2a186fd0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#17 0x00007f3d52cf71f7 in fast_function (nk=, na=1, n=, pp_stack=0x7f3b3c1c45d8, func=0x7f3d3f6ee5f0) at Python/ceval.c:4447
#18 call_function (oparg=, pp_stack=0x7f3b3c1c45d8) at Python/ceval.c:4372
#19 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#20 0x00007f3d52cf7345 in fast_function (nk=, na=, n=, pp_stack=0x7f3b3c1c4748, func=0x7f3d2aea9c80) at Python/ceval.c:4437
#21 call_function (oparg=, pp_stack=0x7f3b3c1c4748) at Python/ceval.c:4372
#22 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#23 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d528fcc30, globals=, locals=, args=, argcount=2, kws=0x7f3d2a18dc68, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#24 0x00007f3d52cf71f7 in fast_function (nk=, na=2, n=, pp_stack=0x7f3b3c1c4968, func=0x7f3d2a33f0c8) at Python/ceval.c:4447
#25 call_function (oparg=, pp_stack=0x7f3b3c1c4968) at Python/ceval.c:4372
#26 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#27 0x00007f3d52cf7345 in fast_function (nk=, na=, n=, pp_stack=0x7f3b3c1c4ad8, func=0x7f3d2a33f410) at Python/ceval.c:4437
#28 call_function (oparg=, pp_stack=0x7f3b3c1c4ad8) at Python/ceval.c:4372
#29 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#30 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d52963db0, globals=, locals=, args=, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#31 0x00007f3d52c72a61 in function_call (func=0x7f3d2a33f8c0, arg=0x7f3d529377d0, kw=0x0) at Objects/funcobject.c:523
#32 0x00007f3d52c42e93 in PyObject_Call (func=0x7f3d2a33f8c0, arg=, kw=) at Objects/abstract.c:2547
#33 0x00007f3d52ced7b3 in PyEval_CallObjectWithKeywords (func=0x7f3d2a33f8c0, arg=0x7f3d529377d0, kw=) at Python/ceval.c:4221
#34 0x00007f3d52d13468 in PyEval_CallMethod (obj=, methodname=, format=) at Python/modsupport.c:612
#35 0x00007f3d5303141f in ?? ()
#36 0x0000000000000000 in ?? ()


In addition:
there are occasions when other threads are blocked at the same time, such as the stack information below, which is the stack information of an unrelated CPU thread. The strange thing is that there is actually libmxnet.so:

Thread 70 (Thread 0x7f3b0bff6700 (LWP 41409)):
#0 0x00007f3d582fd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f3d580979bc in __gthread_cond_wait (__mutex=, __cond=) at /data/home/xxx/gcc-build/gcc-4.9.4/build/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3 0x00007f3c7bcb88a3 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#4 0x00007f3c7bcc0339 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#5 0x00007f3d577c4702 in fork () from /lib64/libc.so.6
......

@vtkingdom
Copy link

Are there any hints for the reason why the thread is blocked when calling the MXNDArraySyncCopyToCPU method ? Just like under some special suitations or usage ?

@marcoabreu
Copy link
Contributor

Hi, MXNet does not support multithreaded interaction with its frontend APIs. We rather require a sticky-thread for this.

This means that you have to follow the dispatcher model which dedicates one thread during the entire lifecycle of your application to internet with MXNet. It's important that you don't only use a mutex since we depend on the threadlocal variables that are assigned to the dispatcher thread.

@coconutyao
Copy link
Author

Hi, MXNet does not support multithreaded interaction with its frontend APIs. We rather require a sticky-thread for this.

This means that you have to follow the dispatcher model which dedicates one thread during the entire lifecycle of your application to internet with MXNet. It's important that you don't only use a mutex since we depend on the threadlocal variables that are assigned to the dispatcher thread.


Thanks for the reply, I did not make it clear before. We started 8 processes on one machine, and only 1 thread per process uses mxnet (other threads handle different works). We called the python engine in a C++ program that uses the mxnet-python api. Is there a problem with this usage?

@marcoabreu
Copy link
Contributor

marcoabreu commented Oct 23, 2018

That sounds good to me. Could you maybe show some minimal example that allows to reproduce the problem?

I'll let somebody else follow up on your issue since we're now getting to the Python-API.

@andrewfayres
Copy link
Contributor

@mxnet-label-bot [Python, Thread Safety]

@coconutyao
Copy link
Author

That sounds good to me. Could you maybe show some minimal example that allows to reproduce the problem?

I'll let somebody else follow up on your issue since we're now getting to the Python-API.

Hello, sorry to reply so late. This problem no longer occurs after changing the engine type of MxNet from the default ThreadedEnginePerDevice to ThreadedEngine. I hope this can give you some clues.

@piyushghai
Copy link
Contributor

@coconutyao Good to see that your issue was resolved. I'm closing this issue. Please feel free to re-open if closed in error.

@lanking520 Can you please close this issue ?

Thanks!

@lanking520
Copy link
Member

@coconutyao Close the issue for now. Please feel free to reopen it if you are still facing the problem

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants