Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MXNetError after first detection and recognition #415

Closed
WIll-Xu35 opened this issue Oct 29, 2018 · 10 comments
Closed

MXNetError after first detection and recognition #415

WIll-Xu35 opened this issue Oct 29, 2018 · 10 comments

Comments

@WIll-Xu35
Copy link

Hi all,

I am trying to apply this repository to a server side face register and recognition service. I have tried to detect faces and generate embeddings for all the photos in a directory (contains 300 photos) and it worked fine. However, when I attach it to local server code, it can only do a single face detection and embedding generation. Once a second detection is called, it raises an error.

my configuration is CUDA 9.0, mxnet 1.3.0, cudnn7, python2.7

Detailed error messages here:

File "server.py", line 118, in login
login_res, message = face_verification(file_path, regis_path, username)
File "server.py", line 14, in face_verification
result, data = server_function.verify(embedding_dir, photo_dir, login_id)
File "/home/wenbin/project/mxnet_faceID/server_function.py", line 88, in verify
img_tmp = model.get_input(image)
File "/home/wenbin/project/mxnet_faceID/face_model.py", line 71, in get_input
ret = self.detector.detect_face(face_img, det_type = self.args.det)
File "/home/wenbin/project/mxnet_faceID/mtcnn_detector.py", line 493, in detect_face
output = self.LNet.predict(input_buf)
File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/model.py", line 717, in predict
o_list.append(o_nd[0:real_size].asnumpy())
File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/base.py", line 210, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [11:00:54] src/operator/nn/./cudnn/cudnn_convolution-inl.h:156: Check failed: e == CUDNN_STATUS_SUCCESS (7 vs. 0) cuDNN: CUDNN_STATUS_MAPPING_ERROR

Stack trace returned 10 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f1372ee4dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f1372ee5938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::op::CuDNNConvolutionOp::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x389) [0x7f1377346829]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::op::ConvolutionComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0xbfc) [0x7f137733bbec]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x59) [0x7f13754883f9]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(+0x317c8d3) [0x7f13754348d3]
[bt] (6) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f1375a92185]
[bt] (7) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f1375aa931b]
[bt] (8) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f1375aa958e]
[bt] (9) /home/wenbin/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x4a) [0x7f1375a9178a]

[11:00:54] src/resource.cc:262: Ignore CUDA Error [11:00:54] src/storage/./pooled_storage_manager.h:85: CUDA: an illegal memory access was encountered

Stack trace returned 10 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f1372ee4dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f1372ee5938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x95) [0x7f1375ab5815]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFree(mxnet::Storage::Handle)+0x3d) [0x7f1375ab81bd]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::StorageImpl::DirectFree(mxnet::Storage::Handle)+0x68) [0x7f1375ab1418]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::resource::ResourceManagerImpl::ResourceTempSpace::~ResourceTempSpace()::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0xff) [0x7f1375b8090f]
[bt] (6) /home/wenbin/mxnet/lib/libmxnet.so(+0x37dfe01) [0x7f1375a97e01]
[bt] (7) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f1375a92185]
[bt] (8) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x65) [0x7f1375aad085]
[bt] (9) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, mxnet::FnProperty, int, char const*, bool)+0x1b0) [0x7f1375a98400]

Any help would be appreciated!

@WIll-Xu35
Copy link
Author

WIll-Xu35 commented Oct 30, 2018

another error that can happen is like this:

terminate called after throwing an instance of 'dmlc::Error'
what(): [15:44:39] src/engine/./threaded_engine.h:379: array::at: __n (which is 18) >= _Nm (which is 7)
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 9 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f4f5fe2ddcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f4f5fe2e938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xfa9) [0x7f4f629db849]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f4f629f231b]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f4f629f258e]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x4a) [0x7f4f629da78a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f4f77694c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f4f960066ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4f95d3c41d]

@diggerdu
Copy link

diggerdu commented Nov 8, 2018

same problem, unpredictable occurence

@WIll-Xu35
Copy link
Author

@diggerdu I solved the problem by ensuring that only a single thread calls the initialized MXNet model. If multiple threads call the same model, this kind of error would happen.

To be more specific, in my server script I used flask and it, by default, enables multithreading to handle input requests. After I set multithreading parameter to false, everything works perfectly.

You can also find some more information here: apache/mxnet#3946

Hope this will help you solve your problem.

@fanqie03
Copy link

Thank you very mach @WIll-Xu35

@ideaRunner
Copy link

Hi, @WIll-Xu35 , Thank you.
You are my life saver!

@Luvata
Copy link

Luvata commented Sep 26, 2019

Thank you @WIll-Xu35 , I'm struggling this whole afternoon because of this error :(
For more specific:

app.run(threaded=False)

@Zenvi
Copy link

Zenvi commented Feb 5, 2020

@diggerdu I solved the problem by ensuring that only a single thread calls the initialized MXNet model. If multiple threads call the same model, this kind of error would happen.

To be more specific, in my server script I used flask and it, by default, enables multithreading to handle input requests. After I set multithreading parameter to false, everything works perfectly.

You can also find some more information here: apache/incubator-mxnet#3946

Hope this will help you solve your problem.

大佬牛逼,给跪了

@cydawn
Copy link

cydawn commented Jun 4, 2020

@diggerdu I solved the problem by ensuring that only a single thread calls the initialized MXNet model. If multiple threads call the same model, this kind of error would happen.

To be more specific, in my server script I used flask and it, by default, enables multithreading to handle input requests. After I set multithreading parameter to false, everything works perfectly.

You can also find some more information here: apache/incubator-mxnet#3946

Hope this will help you solve your problem.

Nice job!

@yeyupiaoling
Copy link

@diggerdu I solved the problem by ensuring that only a single thread calls the initialized MXNet model. If multiple threads call the same model, this kind of error would happen.

To be more specific, in my server script I used flask and it, by default, enables multithreading to handle input requests. After I set multithreading parameter to false, everything works perfectly.

You can also find some more information here: apache/mxnet#3946

Hope this will help you solve your problem.

Thank you @WIll-Xu35 , I'm struggling this whole afternoon because of this error :(
For more specific:

app.run(threaded=False)

感谢大佬,完美解决我的问题

@hhsummerwind
Copy link

@diggerdu I solved the problem by ensuring that only a single thread calls the initialized MXNet model. If multiple threads call the same model, this kind of error would happen.

To be more specific, in my server script I used flask and it, by default, enables multithreading to handle input requests. After I set multithreading parameter to false, everything works perfectly.

You can also find some more information here: apache/incubator-mxnet#3946

Hope this will help you solve your problem.

Thank you @WIll-Xu35 , I'm struggling this whole afternoon because of this error :(
For more specific:

app.run(threaded=False)

Thank you!! It works for me!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants