Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument #309

Closed
pluskid opened this issue Oct 17, 2015 · 15 comments · Fixed by #312
Closed

Comments

@pluskid
Copy link
Contributor

pluskid commented Oct 17, 2015

While testing on a simple MNIST example on the Julia binding, I noticed the following error while the program finish running.

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

signal (6): Abort trap: 6
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)

Before trying to dig out what is happening here, is this kind of error look familiar to anyone? Here are some more information:

  • It only happens under Mac OS X. I also see this error on the unittest in travis OS X node, but locally unittest will not trigger this error. I have never see this error under Linux.
  • It occurs only when the program finished running.
@tqchen
Copy link
Member

tqchen commented Oct 17, 2015

Might due to module unloading and de-allocation of things after module destruction. I am not sure exactly what it could be though.

@pluskid
Copy link
Contributor Author

pluskid commented Oct 17, 2015

Hmm... that could be possible. But even if I disabled all calls to MX***Free destructors, I still get the same error. Is there anything else that should be called other than MXNotifyShutdown to properly shutdown everything?

@tqchen
Copy link
Member

tqchen commented Oct 17, 2015

No, there is no need to call other things, MXNotifyShutdown is only a hint for the engine, which is also not needed, if the thing is reproducible , can you check where the lock exactly is?

@pluskid
Copy link
Contributor Author

pluskid commented Oct 17, 2015

OK, I managed to get some back trace:

* thread #1: tid = 0x38acd5, 0x00007fff8b9e80ae libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff8b9e80ae libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff95517500 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff9081137b libsystem_c.dylib`abort + 129
    frame #3: 0x00007fff8f77ff81 libc++abi.dylib`abort_message + 257
    frame #4: 0x00007fff8f7a5a2f libc++abi.dylib`default_terminate_handler() + 243
    frame #5: 0x00007fff9879e0a2 libobjc.A.dylib`_objc_terminate() + 124
    frame #6: 0x00007fff8f7a319e libc++abi.dylib`std::__terminate(void (*)()) + 8
    frame #7: 0x00007fff8f7a3213 libc++abi.dylib`std::terminate() + 51
    frame #8: 0x0000000314811286 libmxnet.so`__clang_call_terminate + 22
    frame #9: 0x000000031484be67 libmxnet.so`mxnet::resource::ResourceManagerImpl::ResourceRandom<mshadow::cpu>::~ResourceRandom(this=0x000000032d2de7f0) + 263 at resource.cc:112
    frame #10: 0x000000031484bd55 libmxnet.so`mxnet::resource::ResourceManagerImpl::ResourceRandom<mshadow::cpu>::~ResourceRandom(this=0x000000032d2de7f0) + 21 at resource.cc:110
    frame #11: 0x000000031485f368 libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl() [inlined] std::__1::default_delete<mxnet::resource::ResourceManagerImpl::ResourceRandom<mshadow::cpu> >::operator(this=0x0000000314bf8df8, __ptr=0x000000032d2de7f0)(mxnet::resource::ResourceManagerImpl::ResourceRandom<mshadow::cpu>*) const + 39 at memory:2459
    frame #12: 0x000000031485f341 libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl() [inlined] std::__1::unique_ptr<mxnet::resource::ResourceManagerImpl::ResourceRandom<mshadow::cpu>, std::__1::default_delete<mxnet::resource::ResourceManagerImpl::ResourceRandom<mshadow::cpu> > >::reset(this=0x0000000314bf8df8, __p=0x0000000000000000) + 166 at memory:2658
    frame #13: 0x000000031485f29b libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl(this=0x0000000314bf8dd0) + 75 at resource.cc:33
    frame #14: 0x0000000314842a45 libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl(this=0x0000000314bf8dd0) + 21 at resource.cc:31
    frame #15: 0x00007fff908120ff libsystem_c.dylib`__cxa_finalize_ranges + 345
    frame #16: 0x00007fff90812403 libsystem_c.dylib`exit + 55
    frame #17: 0x0000000100000acb julia`start + 59

@pluskid
Copy link
Contributor Author

pluskid commented Oct 17, 2015

The error is always reproducible and always the same. If I comment out the chunk at resource.cc:112, then the error becomes:

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
Process 69138 stopped
* thread #1: tid = 0x38e508, 0x00007fff8b9e80ae libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff8b9e80ae libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff8b9e80ae <+10>: jae    0x7fff8b9e80b8            ; <+20>
    0x7fff8b9e80b0 <+12>: movq   %rax, %rdi
    0x7fff8b9e80b3 <+15>: jmp    0x7fff8b9e33ef            ; cerror_nocancel
    0x7fff8b9e80b8 <+20>: retq
(lldb) bt
* thread #1: tid = 0x38e508, 0x00007fff8b9e80ae libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff8b9e80ae libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff95517500 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff9081137b libsystem_c.dylib`abort + 129
    frame #3: 0x00007fff8f77ff81 libc++abi.dylib`abort_message + 257
    frame #4: 0x00007fff8f7a5a2f libc++abi.dylib`default_terminate_handler() + 243
    frame #5: 0x00007fff9879e0a2 libobjc.A.dylib`_objc_terminate() + 124
    frame #6: 0x00007fff8f7a319e libc++abi.dylib`std::__terminate(void (*)()) + 8
    frame #7: 0x00007fff8f7a3213 libc++abi.dylib`std::terminate() + 51
    frame #8: 0x0000000313911676 libmxnet.so`__clang_call_terminate + 22
    frame #9: 0x000000031394cfe4 libmxnet.so`mxnet::resource::ResourceManagerImpl::ResourceTempSpace<mshadow::cpu>::~ResourceTempSpace(this=0x00000003072bd980) + 548 at resource.cc:160
    frame #10: 0x000000031394cdb5 libmxnet.so`mxnet::resource::ResourceManagerImpl::ResourceTempSpace<mshadow::cpu>::~ResourceTempSpace(this=0x00000003072bd980) + 21 at resource.cc:152
    frame #11: 0x000000031395c8a2 libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl() [inlined] std::__1::default_delete<mxnet::resource::ResourceManagerImpl::ResourceTempSpace<mshadow::cpu> >::operator(this=0x0000000313cf5c50, __ptr=0x00000003072bd980)(mxnet::resource::ResourceManagerImpl::ResourceTempSpace<mshadow::cpu>*) const + 39 at memory:2459
    frame #12: 0x000000031395c87b libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl() [inlined] std::__1::unique_ptr<mxnet::resource::ResourceManagerImpl::ResourceTempSpace<mshadow::cpu>, std::__1::default_delete<mxnet::resource::ResourceManagerImpl::ResourceTempSpace<mshadow::cpu> > >::reset(this=0x0000000313cf5c50, __p=0x0000000000000000) + 156 at memory:2658
    frame #13: 0x000000031395c7df libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl(this=0x0000000313cf5c20) + 335 at resource.cc:34
    frame #14: 0x0000000313942e35 libmxnet.so`mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl(this=0x0000000313cf5c20) + 21 at resource.cc:31
    frame #15: 0x00007fff908120ff libsystem_c.dylib`__cxa_finalize_ranges + 345
    frame #16: 0x00007fff90812403 libsystem_c.dylib`exit + 55
    frame #17: 0x0000000100000acb julia`start + 59

@pluskid
Copy link
Contributor Author

pluskid commented Oct 17, 2015

Further commenting out the function body at resource.cc:160, then the error disappeared.

@tqchen
Copy link
Member

tqchen commented Oct 17, 2015

Seems was due to engine singleton get deleted before the resource manager. Which is not suppose to be so due to https://github.com/dmlc/mxnet/blob/master/src/resource.cc#L39

We had a way to get the shared pointer from engine, to ensure it is always deleted after the resource manager.

Can you check if the destructor of engine happens before or after this destructor ?

It is here
https://github.com/dmlc/mxnet/blob/master/src/engine/threaded_engine_perdevice.cc#L47

@tqchen
Copy link
Member

tqchen commented Oct 17, 2015

Here are some more background. The singleton pattern in C++ can have potential problem, especially in terms of destructing order.

For example, in this example, if engine singleton get destructed before resource manager singleton, then this can lead to an undefined behavior of Engine::Get()

To prevent that, the current implementation place the singleton in a shared pointer, and allow resource manager to get reference of that shared ptr, so engine won't get destructed before resource manager did

@pluskid
Copy link
Contributor Author

pluskid commented Oct 17, 2015

Yes, thanks for the backgrounds. But it seems the destructor of the engine is not called before the destructor of the resource manager. I inserted printf at the front of both destructors. The destructor ~ResourceManagerImpl() is being called, but I did not see the destructor ~ThreadedEnginePerDevice() printed yet.

@pluskid
Copy link
Contributor Author

pluskid commented Oct 18, 2015

And I could still print out non-NULL the address of the Engine object in front of the resource manager destructor. So it seems at least the shared_ptr holding the engine has not reset yet.

@pluskid
Copy link
Contributor Author

pluskid commented Oct 18, 2015

Currently I traced it to this line, trying to allocate new operator: https://github.com/dmlc/mxnet/blob/master/src/engine/threaded_engine.cc#L190

@pluskid
Copy link
Contributor Author

pluskid commented Oct 18, 2015

OK, the actually error on mutex happens at this line:
https://github.com/dmlc/mxnet/blob/master/src/common/object_pool.h#L121

@pluskid
Copy link
Contributor Author

pluskid commented Oct 18, 2015

I think I found the bug:

~ObjectPool() this = 0x3163f5990
~ObjectPool() this = 0x3163f5920
~ResourceManagerImpl()

...

* thread #1: tid = 0x398e65, 0x0000000316075047 libmxnet.so`mxnet::engine::ThreadedOpr* mxnet::common::ObjectPool<mxnet::engine::ThreadedOpr>::New<>(this=0x00000003163f5920) + 23 at object_pool.h:123, queue = 'com.apple.main-thread', stop reason = step in
    frame #0: 0x0000000316075047 libmxnet.so`mxnet::engine::ThreadedOpr* mxnet::common::ObjectPool<mxnet::engine::ThreadedOpr>::New<>(this=0x00000003163f5920) + 23 at object_pool.h:123
   120  T* ObjectPool<T>::New(Args&&... args) {
   121    LinkedList* ret;
   122    {
-> 123      std::lock_guard<std::mutex> lock{m_};
   124      if (head_->next == nullptr) {
   125        AllocateChunk();
   126      }
(lldb) p this
(mxnet::common::ObjectPool<mxnet::engine::ThreadedOpr> *) $4 = 0x00000003163f5920

Note the object ObjectPool<mxnet::engine::ThreadedOpr> at 3163f5920 is used after it get destructed.

@tqchen
Copy link
Member

tqchen commented Oct 18, 2015

OK, this was another singleton destructing problem. Thanks to point this out

@tqchen
Copy link
Member

tqchen commented Oct 18, 2015

This can be fixed by the shared pointer trick and let the engine hold the ref to the obj pool, @pluskid do you want to make a PR on this?

Also cc @hotpxl who was original author of this code

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants