Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

ThreadedOpr::Delete(threaded_opr) being called multiple times #6039

Closed
sifmelcara opened this issue Apr 29, 2017 · 3 comments
Closed

ThreadedOpr::Delete(threaded_opr) being called multiple times #6039

sifmelcara opened this issue Apr 29, 2017 · 3 comments

Comments

@sifmelcara
Copy link
Contributor

Environment info

Operating System: Linux
Compiler: gcc 5.4.0
Package used (Python/R/Scala/Julia): C++
MXNet version: 703e8ee (current master)

Error Message:

segmentation fault or free: invalid pointer

Steps to reproduce

  1. set cuda, cudnn, openblas, opencv flags, and USE_CPP_PACKAGE = 1 in config.mk
  2. make -j4
  3. run LD_LIBRARY_PATH=$LD_LIBRARY_PATH:lib valgrind ./build/cpp-package/example/lenet
  4. a memory error will be detected after training start.

Without valgrind, the segmentation fault appears randomly, so I use valgrind to help immediately catch the code causes the segmentation fault when it happened.

After running Valgrind I found that the segmentation fault is caused by
ThreadedOpr::Delete(threaded_opr)
being called multiple times, makes the destructor execute twice and (possibly) mess up the object pool.

Valgrind log: https://gist.github.com/sifmelcara/cef9f8a4d7e4d7f8de1520419476f4b0#file-log-L127

I also observed that placing a mutex in ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) prevents the segmentation fault, makes Valgrind happy and MXNet never crashes again,
although I believe this is not the correct solution.

I traced the code and still have no idea why the OnComplete callback will be called more than once using same threaded_opr, so I decided to open the issue and hope someone can help resolve the problem. Thank you very much.

@conopt
Copy link
Contributor

conopt commented May 2, 2017

I tried GCC 4.8.5, 5.3.1 and 6.2.1. Cannot reproduce...

@sifmelcara
Copy link
Contributor Author

@lx75249 could you try to stress the machine using something like stress -i 10000 when training the network?
I have tested this issue on two machines, one of the them do not have the problem of segmentation fault under normal conditions, but mxnet immediately crashes when system is under heavy load.
Thank you for testing this out.

@sifmelcara
Copy link
Contributor Author

@lx75249 Thank you and no need to test it again since I found the bug and opens the pr #6084 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants