-
Notifications
You must be signed in to change notification settings - Fork 6.8k
uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument #309
Comments
Might due to module unloading and de-allocation of things after module destruction. I am not sure exactly what it could be though. |
Hmm... that could be possible. But even if I disabled all calls to |
No, there is no need to call other things, MXNotifyShutdown is only a hint for the engine, which is also not needed, if the thing is reproducible , can you check where the lock exactly is? |
OK, I managed to get some back trace:
|
The error is always reproducible and always the same. If I comment out the chunk at
|
Further commenting out the function body at |
Seems was due to engine singleton get deleted before the resource manager. Which is not suppose to be so due to https://github.com/dmlc/mxnet/blob/master/src/resource.cc#L39 We had a way to get the shared pointer from engine, to ensure it is always deleted after the resource manager. Can you check if the destructor of engine happens before or after this destructor ? |
Here are some more background. The singleton pattern in C++ can have potential problem, especially in terms of destructing order. For example, in this example, if engine singleton get destructed before resource manager singleton, then this can lead to an undefined behavior of Engine::Get() To prevent that, the current implementation place the singleton in a shared pointer, and allow resource manager to get reference of that shared ptr, so engine won't get destructed before resource manager did |
Yes, thanks for the backgrounds. But it seems the destructor of the engine is not called before the destructor of the resource manager. I inserted |
And I could still print out non-NULL the address of the |
Currently I traced it to this line, trying to allocate new operator: https://github.com/dmlc/mxnet/blob/master/src/engine/threaded_engine.cc#L190 |
OK, the actually error on mutex happens at this line: |
I think I found the bug:
Note the object |
OK, this was another singleton destructing problem. Thanks to point this out |
* [CI] turn on keras frontend test * fix * using tensorflow cpu version
While testing on a simple MNIST example on the Julia binding, I noticed the following error while the program finish running.
Before trying to dig out what is happening here, is this kind of error look familiar to anyone? Here are some more information:
The text was updated successfully, but these errors were encountered: