MXNet C++Interface reasoning leads to CPU memory leak #13265

chenyujing1234 · 2018-11-14T16:26:24Z

I am using the c++ interface of mxnet to carry out the reasoning of mtcnn algorithm for face detection.
Environment: Ubuntu16.04.1 + GPU cuda9.0 + MXNet1.3.0
I use the interface: MXPredCreate MXPredReshape MXPredForward MXPredFree...
When I used a lot of pictures for stress testing (running for a long time), I found that my process occupied more and more CPU memory, and eventually it occupied all the memory.
My process was forced to kill by the system.
Later, I wrote the program To Test:

If only do MXPredCreate, then MXPredFree, and MXPredCreate gives different width and height.
After 100 thousand cycles, memory will increase by 1G
If you do MXPredCreate, and then MXPredReshape to change the width and height, so that the continuous cycle, found that memory leaks very quickly, less than half an hour to leak 4G.
Do you want to invite MXNet's main developers to find this problem? Please track it down for me. It has been bothering me for almost a week.

zachgk · 2018-11-14T19:13:49Z

@mxnet-label-bot add [C++, Memory, Bug, Pending Requester Info]

Thanks for raising the issue @chenyujing1234. Can you provide a bit more info so we can better help?

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Build info (If built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Minimum reproducible example

Can you share the code for the program that you used to test the memory leak?

KellenSunderland · 2018-11-14T20:24:04Z

@chenyujing1234 Would you be able to post some sample code, that when run in a loop will leak? If we can run this for a fixed number of iterations (say 1234 loops) we should then be able to look with an ASAN build and see what's leaking.

leleamol · 2018-11-15T01:43:09Z

@mxnet-label-bot update [Bug, C API, Memory, Pending Requester Info]

chenyujing1234 · 2018-11-15T11:57:55Z

You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

chenyujing1234 · 2018-11-15T11:58:18Z

@mxnet-label-bot update [Bug, C API, Memory, Pending Requester Info]
You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

chenyujing1234 · 2018-11-15T11:58:38Z

@chenyujing1234 Would you be able to post some sample code, that when run in a loop will leak? If we can run this for a fixed number of iterations (say 1234 loops) we should then be able to look with an ASAN build and see what's leaking.

You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

chenyujing1234 · 2018-11-15T11:59:02Z

@mxnet-label-bot add [C++, Memory, Bug, Pending Requester Info]

Thanks for raising the issue @chenyujing1234. Can you provide a bit more info so we can better help?

Environment info (Required)
What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.
Build info (If built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Minimum reproducible example

Can you share the code for the program that you used to test the memory leak?

You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

KellenSunderland · 2018-11-15T18:17:49Z

Great bug report @chenyujing1234. Really appreciate the details. I'll take a look at this on a best-effort basis (might take a while to verify the leaks on my end).

If anyone else from the community has time to jump on this I think it'd be very useful. I've seen a lot of reports of leaks from users.

chinakook · 2018-11-16T06:30:19Z

Are you using a mxnet version that contain MKL-DNN?

chenyujing1234 · 2018-11-16T07:22:41Z

Are you using a mxnet version that contain MKL-DNN?

No

chenyujing1234 · 2018-11-19T02:13:06Z

Great

Now my project is stuck here, waiting for your reply. Thank you

KellenSunderland · 2018-11-19T02:18:56Z

Any chance you could paste the reproducing code on gist.github.com?

chenyujing1234 · 2018-11-19T02:28:52Z

Any chance you could paste the reproducing code on gist.github.com?

I am sorry , I can't visit gist.github.com. Can you visit https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

KellenSunderland · 2018-11-19T02:36:16Z

Looks like I need to install an app to access the files, but no worries, I can likely download the code with a colleague tomorrow.

AvenSun · 2018-11-19T04:57:55Z

@KellenSunderland
I uploaded the test code attached by @chenyujing1234 to google drive , you can get it by this link

chenyujing1234 · 2018-11-19T13:04:10Z

Looks like I need to install an app to access the files, but no worries, I can likely download the code with a colleague tomorrow.

I Build ASAN versin, and test, Found memory leak.
the log is :
https://gist.github.com/chenyujing1234/0449ecf6f502e5c3538e4f2f018a04e1

KellenSunderland · 2018-11-19T19:05:29Z

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet

chenyujing1234 · 2018-11-20T02:32:13Z

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet
Some of my test code need to be modified, otherwise the memory overflow error will be reported.
include/mxnet_mtcnn.hpp
42 buffer_.reset(new char[length_ + 1]);
43 ifs.read(buffer_.get(), length_);
44 ifs.close();
45 buffer_[length_] = '\0';

chenyujing1234 · 2018-11-21T01:45:13Z

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet

Has any result ?

chenyujing1234 · 2018-11-22T06:52:50Z

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet

I've solved one the memory leak problem caused by using reshape functions in the CPU version.
Cause: PredictorHandle object is not released after calling Reshape.

But there will still be memory leaks in reasoning on gpu, but I don't know how to compile the version of ASAN. Can you help me to compile the version of ASAN of GPU-cuda 9.0?

chenyujing1234 · 2018-11-23T15:19:53Z

Now GPU reasoning will still have memory leaks, troubled for two weeks, now want to go online, but stuck here. I don't know what you got there?

chinakook · 2018-11-24T05:00:47Z

You can drop reshape to avoid this.

loadwiki · 2018-12-20T09:37:07Z

I found the same issue. Every time I do reshape first and Free the reshaped handle. However, mem usage increased very quickly. My mxnet version is 1.2.1rc1, built with mkldnn. The issue both appeared in mac os and ubuntu. @chinakook If drop reshape, must call MXPredCreate every time before do infer. Would there be a performance issue? Or it seems that i could prepare a few fixed size models for dynamic input.

chinakook · 2018-12-20T12:51:39Z

Create a fixed model for the biggest size of input image, and pad every image to this biggest one.

chenyujing1234 · 2018-12-24T01:09:36Z

Finally, it was found that because the original Pred object was not released when Reshape Function was called, it would not be leaked if it was released.

leleamol · 2019-01-04T23:35:45Z

@chenyujing1234 can you please let us know if the issue is fixed? If it is we can close this issue.
Thanks,

leleamol · 2019-01-04T23:35:58Z

@mxnet-label-bot add [Pending Requester Info]

KellenSunderland · 2019-01-15T01:21:14Z

I think this should have been fixed by #13376 (we may want to consider backporting it given the number of reports).

JohnLee168 · 2019-01-17T01:39:30Z

I found the same issue. Every time I do reshape first and Free the reshaped handle. However, mem usage increased very quickly. My mxnet version is 1.2.1rc1, built with mkldnn. The issue both appeared in mac os and ubuntu. @chinakook If drop reshape, must call MXPredCreate every time before do infer. Would there be a performance issue? Or it seems that i could prepare a few fixed size models for dynamic input.

I have the same issue, I just use MXPredCreate and MXPredFree everytime, but still got cpu and gpu memory leak. it is slow but still leaks memory.

JohnLee168 · 2019-01-17T01:43:20Z

Finally, it was found that because the original Pred object was not released when Reshape Function was called, it would not be leaked if it was released.

you mean we should manually release the former handle after reshape, and manually release reshape handle after use?

coolwebgo · 2019-02-01T01:56:22Z

I also got the memory leak , call like this :
MXPredCreate()
MXPredSetInput(x);
MXPredForward(x); //memory leak
MXPredFree(x);

, when I mask the call : MXPredForward(), the mem leak will not turn up(memory use do not grow).

KellenSunderland · 2019-02-03T20:17:13Z

I tried to reproduce the memory leak with an ASAN build and don't see anything that jumps out as a leak. There's a number of data structures allocated and then not released, but these structures don't seem to increase over time/iterations.

My could sample runs a simplification of the image-classification demo. A snippet of the code tested for leaks is here: https://github.com/apache/incubator-mxnet/blob/8e2c0adb61b5f5ceee4d090f1413c8697a61e008/example/image-classification/predict-cpp/image-classification-predict.cc

The results reported by ASAN are here: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/miscellaneous/branches/PR-13917/runs/14/nodes/137/log/?start=0

What I would expect to see if there was a genuine leak in my code would be an ASAN leak summary that would report leaked objects that are some multiple of '1234', my iteration size. I don't see any objects that look like they were leaked as part of a MXPredCreate/MXPredForward/MXPredFree iteration.

What I did notice is that the image-classification sample has a buffer overflow as a result of not properly null-terminating the char* buffer containing the symbolic graph description. I'll see if I can submit a PR to fix that, and if anyone is basing their code on this sample please make sure that the symbol buffer you pass into MXPredCreate is null terminated.

If others are still having problems with leaks, would you be able to post a small sample that replicates the leak? That way others can have a look at your specific sample to determine where this leak is coming from.

arcadiaphy · 2019-02-15T03:53:05Z

@KellenSunderland I found the memory leaks in ASAN reports of image classification example mainly come from https://github.com/apache/incubator-mxnet/blob/master/src/common/object_pool.h#L135

Why commented out the free operation?

stephenrawls · 2019-02-15T04:30:28Z

@arcadiaphy I'm relatively new to the code base so take this with a grain of salt, but my guess is the following:

Memory allocation on GPU is expensive so MxNet doesn't really free it during the life of the program, they re-use it using their custom allocator pool
They could free the memory at program exit when the allocator pool gets destructed but:
a) Who cares, if the program is exiting anyway, let the OS take care of re-claiming the memory
b) As the comment in the code you pointed out mentions, they need to be careful that they don't free the memory until after all other destructors that might still have pointers to the memory try to access it; if there are global variables or something then destructor order can be tricky and why bother anyway because of (a)

Actually just dug a little more and saw this comment:
https://github.com/apache/incubator-mxnet/blob/149d8105ea3c4dfd63c3c7e25b3be1e4c4f2ec45/src/engine/threaded_engine.h#L543-L549

That points here:
#309

So my guess is my explanation is probably mostly correct.

arcadiaphy · 2019-02-15T06:33:40Z

@stephenrawls The object pool is used to quickly allocate variables when threaded engine schedules computing operations, so it's not related to GPU memory.

I think the object pools will only be destructed at program exit, so I cannot imagine what memory problems will be caused during destructing. I have re-added the free operation several weeks ago, and no problems are found in my using process.

After fix this memory pool issue, some ASAN memory tests should be set up in CI to avoid memory leaks in C++ interface.

stephenrawls · 2019-02-15T07:03:48Z

I see, thanks for background. I didn't look closely so just assumed it was about GPU memory.

Re: "I cannot imagine what memory problems will be caused during destruction".

I'm not an expert, but I think that in general destruction order is undefined for global variables and it can cause memory related problems if the memory an object needs during destruction has been freed before its destructor runs. Because the order of destruction is undefined things often seem to run fine until you get unlucky and the order changes.

See the issue I mentioned earlier, or this one: #12613

No idea if it applies here or not though.

arcadiaphy · 2019-02-15T07:32:03Z

@stephenrawls Thanks for mentioning related issues, they are all singleton destructing problems. Crash happens when accessing too early destructed singletons. The main problem of object pools is fixed in #312, maybe there are still some underlying issues.

I think the correct way is to just let problem happens, then we can fix them to approach leak-free codes.

anirudh2290 · 2019-03-08T03:52:11Z

Thanks for the great discussion! @JohnLee168 and @coolwebgo are you guys still seeing memory leak ? Can you provide a reproducible script for this.

lanking520 · 2019-07-17T22:23:41Z

@chenyujing1234 Close this issue for inactivity. Please feel free to reopen if problem persist

marcoabreu added Bug C++ Related to C++ Memory Pending Requester Info labels Nov 14, 2018

marcoabreu added C API and removed C++ Related to C++ labels Nov 15, 2018

KellenSunderland mentioned this issue Jan 17, 2019

[WIP] Enable image classification mem leak test #13917

Open

7 tasks

arcadiaphy mentioned this issue Feb 15, 2019

uncomment memory pool free #14176

Closed

7 tasks

lanking520 closed this as completed Jul 17, 2019

MXNet C++Interface reasoning leads to CPU memory leak #13265

MXNet C++Interface reasoning leads to CPU memory leak #13265

Comments

chenyujing1234 commented Nov 14, 2018

zachgk commented Nov 14, 2018

Environment info (Required)

Build info (If built from source)

Minimum reproducible example

KellenSunderland commented Nov 14, 2018

leleamol commented Nov 15, 2018

chenyujing1234 commented Nov 15, 2018

chenyujing1234 commented Nov 15, 2018

chenyujing1234 commented Nov 15, 2018

chenyujing1234 commented Nov 15, 2018

Environment info (Required)

Build info (If built from source)

Minimum reproducible example

KellenSunderland commented Nov 15, 2018

chinakook commented Nov 16, 2018

chenyujing1234 commented Nov 16, 2018

chenyujing1234 commented Nov 19, 2018

KellenSunderland commented Nov 19, 2018

chenyujing1234 commented Nov 19, 2018

KellenSunderland commented Nov 19, 2018

AvenSun commented Nov 19, 2018

chenyujing1234 commented Nov 19, 2018

KellenSunderland commented Nov 19, 2018

chenyujing1234 commented Nov 20, 2018

chenyujing1234 commented Nov 21, 2018

chenyujing1234 commented Nov 22, 2018

I've solved one the memory leak problem caused by using reshape functions in the CPU version. Cause: PredictorHandle object is not released after calling Reshape.

chenyujing1234 commented Nov 23, 2018

chinakook commented Nov 24, 2018

loadwiki commented Dec 20, 2018

chinakook commented Dec 20, 2018

chenyujing1234 commented Dec 24, 2018

leleamol commented Jan 4, 2019

leleamol commented Jan 4, 2019

KellenSunderland commented Jan 15, 2019

JohnLee168 commented Jan 17, 2019

JohnLee168 commented Jan 17, 2019

coolwebgo commented Feb 1, 2019

KellenSunderland commented Feb 3, 2019 • edited Loading

arcadiaphy commented Feb 15, 2019

stephenrawls commented Feb 15, 2019

arcadiaphy commented Feb 15, 2019 • edited Loading

stephenrawls commented Feb 15, 2019

arcadiaphy commented Feb 15, 2019 • edited Loading

anirudh2290 commented Mar 8, 2019

lanking520 commented Jul 17, 2019

I've solved one the memory leak problem caused by using reshape functions in the CPU version.
Cause: PredictorHandle object is not released after calling Reshape.

KellenSunderland commented Feb 3, 2019 •

edited

Loading

arcadiaphy commented Feb 15, 2019 •

edited

Loading

arcadiaphy commented Feb 15, 2019 •

edited

Loading