ThreadedOpr::Delete(threaded_opr) being called multiple times #6039

sifmelcara · 2017-04-29T14:59:47Z

Environment info

Operating System: Linux
Compiler: gcc 5.4.0
Package used (Python/R/Scala/Julia): C++
MXNet version: 703e8ee (current master)

Error Message:

segmentation fault or free: invalid pointer

Steps to reproduce

set cuda, cudnn, openblas, opencv flags, and USE_CPP_PACKAGE = 1 in config.mk
make -j4
run LD_LIBRARY_PATH=$LD_LIBRARY_PATH:lib valgrind ./build/cpp-package/example/lenet
a memory error will be detected after training start.

Without valgrind, the segmentation fault appears randomly, so I use valgrind to help immediately catch the code causes the segmentation fault when it happened.

After running Valgrind I found that the segmentation fault is caused by
ThreadedOpr::Delete(threaded_opr)
being called multiple times, makes the destructor execute twice and (possibly) mess up the object pool.

Valgrind log: https://gist.github.com/sifmelcara/cef9f8a4d7e4d7f8de1520419476f4b0#file-log-L127

I also observed that placing a mutex in ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) prevents the segmentation fault, makes Valgrind happy and MXNet never crashes again,
although I believe this is not the correct solution.

I traced the code and still have no idea why the OnComplete callback will be called more than once using same threaded_opr, so I decided to open the issue and hope someone can help resolve the problem. Thank you very much.

The text was updated successfully, but these errors were encountered:

conopt · 2017-05-02T08:22:08Z

I tried GCC 4.8.5, 5.3.1 and 6.2.1. Cannot reproduce...

sifmelcara · 2017-05-02T16:06:56Z

@lx75249 could you try to stress the machine using something like stress -i 10000 when training the network?
I have tested this issue on two machines, one of the them do not have the problem of segmentation fault under normal conditions, but mxnet immediately crashes when system is under heavy load.
Thank you for testing this out.

sifmelcara · 2017-05-03T14:57:01Z

@lx75249 Thank you and no need to test it again since I found the bug and opens the pr #6084 .

sifmelcara mentioned this issue Apr 29, 2017

Errors related to malloc and free #5728

Closed

sifmelcara closed this as completed May 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThreadedOpr::Delete(threaded_opr) being called multiple times #6039

ThreadedOpr::Delete(threaded_opr) being called multiple times #6039

sifmelcara commented Apr 29, 2017

conopt commented May 2, 2017

sifmelcara commented May 2, 2017

sifmelcara commented May 3, 2017

ThreadedOpr::Delete(threaded_opr) being called multiple times #6039

ThreadedOpr::Delete(threaded_opr) being called multiple times #6039

Comments

sifmelcara commented Apr 29, 2017

Environment info

Error Message:

Steps to reproduce

conopt commented May 2, 2017

sifmelcara commented May 2, 2017

sifmelcara commented May 3, 2017