Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

NaiveEngine asynchronous error in multi-threading #8966

Open
xinghedyc opened this issue Dec 6, 2017 · 5 comments
Open

NaiveEngine asynchronous error in multi-threading #8966

xinghedyc opened this issue Dec 6, 2017 · 5 comments
Labels
Backend Issues related to the backend of MXNet Feature request Thread Safety

Comments

@xinghedyc
Copy link

xinghedyc commented Dec 6, 2017

when I use NaiveEngine in openmp multi-threading, I binded 2 executor on gpu(0) and gpu(1).
and do exe.forward parallelly. However I got some error when running the program:

[08:56:51] /data1/yuchendai/lpr_sdk/mxnet/dmlc-core/include/dmlc/logging.h:308: [08:56:51] src/engine/naive_engine.cc:169: Check failed: this->req_completed_ NaiveEngine only support synchronize Push so far

Stack trace returned 10 entries:
[bt] (0) ./bin/test(_ZN4dmlc15LogMessageFatalD1Ev+0x30) [0x45d614]
[bt] (1) /data1/yuchendai/lpr_sdk/mxnet/lib/libmxnet.so(_ZN5mxnet6engine11NaiveEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKc+0x3b4) [0x7f3001f9f8f4]
[bt] (2) /data1/yuchendai/lpr_sdk/mxnet/lib/libmxnet.so(_ZN5mxnet6engine11NaiveEngine4PushEPNS0_3OprENS_7ContextEib+0xad) [0x7f3001f9fd4d]
[bt] (3) /data1/yuchendai/lpr_sdk/mxnet/lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor6RunOpsEbmm+0x584) [0x7f3002014dc4]
[bt] (4) /data1/yuchendai/lpr_sdk/mxnet/lib/libmxnet.so(MXExecutorForward+0x15) [0x7f3001fd8935]
[bt] (5) ./bin/test(_ZN5mxnet3cpp8Executor7ForwardEb+0x45) [0x47a0f3]
[bt] (6) ./bin/test(_ZN9PLATE_OCR16CustomizationOcr7forwardERKSsRSt6vectorINS_4WordESaIS4_EE+0x30b) [0x4793d9]
[bt] (7) ./bin/test(_ZNK9PLATE_OCR25CustomizationOcrInterface9RecognizeERKSsRNS_12RecognizeResEi+0x53) [0x45cbff]
[bt] (8) ./bin/test() [0x458c77]
[bt] (9) /lib64/libgomp.so.1(+0xdde5) [0x7f2ffd96ade5]

why the NaiveEngine has asynchronous operations?

@xinghedyc xinghedyc changed the title NaiveEngine error in multi-threading NaiveEngine asynchronous error in multi-threading Dec 6, 2017
@goswamig
Copy link
Contributor

Please add labels: "Feature request", "Thread Safety", "Feature request"

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet and removed C++ Related to C++ labels Jul 17, 2018
@mseeger
Copy link
Contributor

mseeger commented Aug 22, 2018

@KellenSunderland

@mseeger
Copy link
Contributor

mseeger commented Aug 29, 2018

I am getting the same error, when running some pretty benign code on a normal CPU instance (m4.xlarge, Ubuntu Deep Learning AMI). The code is binding several executors in sequence. The error occurs only with NaiveEngine. It does not occur on my Mac.

@rosenrodt
Copy link

It seems like the internal memory pool tries to call cudaFree() on the same resource assigned to two or more MXNet instances when running Naive Engine. The memory pool is supposed to be thread-local singleton so each MXNet instance spawned by each thread does not contend with each other the same resource, but when running in Naive Engine it apparently is not the case. I get all sorts of errors like CUDA invalid pointer error and eventually cuBLAS failure

@apeforest
Copy link
Contributor

Why not just use the ThreadedEngine (default one) for multithreaded inference?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Feature request Thread Safety
Projects
None yet
Development

No branches or pull requests

7 participants