Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet C++Interface reasoning leads to CPU memory leak #13265

Closed
chenyujing1234 opened this issue Nov 14, 2018 · 39 comments
Closed

MXNet C++Interface reasoning leads to CPU memory leak #13265

chenyujing1234 opened this issue Nov 14, 2018 · 39 comments

Comments

@chenyujing1234
Copy link

I am using the c++ interface of mxnet to carry out the reasoning of mtcnn algorithm for face detection.
Environment: Ubuntu16.04.1 + GPU cuda9.0 + MXNet1.3.0
I use the interface: MXPredCreate MXPredReshape MXPredForward MXPredFree...
When I used a lot of pictures for stress testing (running for a long time), I found that my process occupied more and more CPU memory, and eventually it occupied all the memory.
My process was forced to kill by the system.
Later, I wrote the program To Test:

  1. If only do MXPredCreate, then MXPredFree, and MXPredCreate gives different width and height.
    After 100 thousand cycles, memory will increase by 1G
  2. If you do MXPredCreate, and then MXPredReshape to change the width and height, so that the continuous cycle, found that memory leaks very quickly, less than half an hour to leak 4G.
    Do you want to invite MXNet's main developers to find this problem? Please track it down for me. It has been bothering me for almost a week.
@zachgk
Copy link
Contributor

zachgk commented Nov 14, 2018

@mxnet-label-bot add [C++, Memory, Bug, Pending Requester Info]

Thanks for raising the issue @chenyujing1234. Can you provide a bit more info so we can better help?

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Build info (If built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Minimum reproducible example

Can you share the code for the program that you used to test the memory leak?

@KellenSunderland
Copy link
Contributor

@chenyujing1234 Would you be able to post some sample code, that when run in a loop will leak? If we can run this for a fixed number of iterations (say 1234 loops) we should then be able to look with an ASAN build and see what's leaking.

@leleamol
Copy link
Contributor

@mxnet-label-bot update [Bug, C API, Memory, Pending Requester Info]

@marcoabreu marcoabreu added C API and removed C++ Related to C++ labels Nov 15, 2018
@chenyujing1234
Copy link
Author

You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

@chenyujing1234
Copy link
Author

@mxnet-label-bot update [Bug, C API, Memory, Pending Requester Info]
You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

@chenyujing1234
Copy link
Author

@chenyujing1234 Would you be able to post some sample code, that when run in a loop will leak? If we can run this for a fixed number of iterations (say 1234 loops) we should then be able to look with an ASAN build and see what's leaking.

You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

@chenyujing1234
Copy link
Author

@mxnet-label-bot add [C++, Memory, Bug, Pending Requester Info]

Thanks for raising the issue @chenyujing1234. Can you provide a bit more info so we can better help?

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Build info (If built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Minimum reproducible example

Can you share the code for the program that you used to test the memory leak?

You can get my test code from this address. TKS
https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

@KellenSunderland
Copy link
Contributor

Great bug report @chenyujing1234. Really appreciate the details. I'll take a look at this on a best-effort basis (might take a while to verify the leaks on my end).

If anyone else from the community has time to jump on this I think it'd be very useful. I've seen a lot of reports of leaks from users.

@chinakook
Copy link
Contributor

Are you using a mxnet version that contain MKL-DNN?

@chenyujing1234
Copy link
Author

Are you using a mxnet version that contain MKL-DNN?

No

@chenyujing1234
Copy link
Author

Great

Now my project is stuck here, waiting for your reply. Thank you

@KellenSunderland
Copy link
Contributor

Any chance you could paste the reproducing code on gist.github.com?

@chenyujing1234
Copy link
Author

Any chance you could paste the reproducing code on gist.github.com?

I am sorry , I can't visit gist.github.com. Can you visit https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA

@KellenSunderland
Copy link
Contributor

Looks like I need to install an app to access the files, but no worries, I can likely download the code with a colleague tomorrow.

@AvenSun
Copy link
Contributor

AvenSun commented Nov 19, 2018

@KellenSunderland
I uploaded the test code attached by @chenyujing1234 to google drive , you can get it by this link

@chenyujing1234
Copy link
Author

Looks like I need to install an app to access the files, but no worries, I can likely download the code with a colleague tomorrow.

I Build ASAN versin, and test, Found memory leak.
the log is :
https://gist.github.com/chenyujing1234/0449ecf6f502e5c3538e4f2f018a04e1

@KellenSunderland
Copy link
Contributor

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet

@chenyujing1234
Copy link
Author

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet
Some of my test code need to be modified, otherwise the memory overflow error will be reported.
include/mxnet_mtcnn.hpp
42 buffer_.reset(new char[length_ + 1]);
43 ifs.read(buffer_.get(), length_);
44 ifs.close();
45 buffer_[length_] = '\0';

@chenyujing1234
Copy link
Author

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet

Has any result ?

@chenyujing1234
Copy link
Author

Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances.

This guide may also help you:
https://cwiki.apache.org/confluence/display/MXNET/Detecting+Memory+Leaks+and+Buffer+Overflows+in+MXNet

I've solved one the memory leak problem caused by using reshape functions in the CPU version.
Cause: PredictorHandle object is not released after calling Reshape.

But there will still be memory leaks in reasoning on gpu, but I don't know how to compile the version of ASAN. Can you help me to compile the version of ASAN of GPU-cuda 9.0?

@chenyujing1234
Copy link
Author

Now GPU reasoning will still have memory leaks, troubled for two weeks, now want to go online, but stuck here. I don't know what you got there?

@chinakook
Copy link
Contributor

You can drop reshape to avoid this.

@loadwiki
Copy link

I found the same issue. Every time I do reshape first and Free the reshaped handle. However, mem usage increased very quickly. My mxnet version is 1.2.1rc1, built with mkldnn. The issue both appeared in mac os and ubuntu. @chinakook If drop reshape, must call MXPredCreate every time before do infer. Would there be a performance issue? Or it seems that i could prepare a few fixed size models for dynamic input.

@chinakook
Copy link
Contributor

Create a fixed model for the biggest size of input image, and pad every image to this biggest one.

@chenyujing1234
Copy link
Author

Finally, it was found that because the original Pred object was not released when Reshape Function was called, it would not be leaked if it was released.

@leleamol
Copy link
Contributor

leleamol commented Jan 4, 2019

@chenyujing1234 can you please let us know if the issue is fixed? If it is we can close this issue.
Thanks,

@leleamol
Copy link
Contributor

leleamol commented Jan 4, 2019

@mxnet-label-bot add [Pending Requester Info]

@KellenSunderland
Copy link
Contributor

I think this should have been fixed by #13376 (we may want to consider backporting it given the number of reports).

@JohnLee168
Copy link

I found the same issue. Every time I do reshape first and Free the reshaped handle. However, mem usage increased very quickly. My mxnet version is 1.2.1rc1, built with mkldnn. The issue both appeared in mac os and ubuntu. @chinakook If drop reshape, must call MXPredCreate every time before do infer. Would there be a performance issue? Or it seems that i could prepare a few fixed size models for dynamic input.

I have the same issue, I just use MXPredCreate and MXPredFree everytime, but still got cpu and gpu memory leak. it is slow but still leaks memory.

@JohnLee168
Copy link

Finally, it was found that because the original Pred object was not released when Reshape Function was called, it would not be leaked if it was released.

you mean we should manually release the former handle after reshape, and manually release reshape handle after use?

@coolwebgo
Copy link

I also got the memory leak , call like this :
MXPredCreate()
MXPredSetInput(x);
MXPredForward(x); //memory leak
MXPredFree(x);

, when I mask the call : MXPredForward(), the mem leak will not turn up(memory use do not grow).

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Feb 3, 2019

I tried to reproduce the memory leak with an ASAN build and don't see anything that jumps out as a leak. There's a number of data structures allocated and then not released, but these structures don't seem to increase over time/iterations.

My could sample runs a simplification of the image-classification demo. A snippet of the code tested for leaks is here: https://github.com/apache/incubator-mxnet/blob/8e2c0adb61b5f5ceee4d090f1413c8697a61e008/example/image-classification/predict-cpp/image-classification-predict.cc

The results reported by ASAN are here: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/miscellaneous/branches/PR-13917/runs/14/nodes/137/log/?start=0

What I would expect to see if there was a genuine leak in my code would be an ASAN leak summary that would report leaked objects that are some multiple of '1234', my iteration size. I don't see any objects that look like they were leaked as part of a MXPredCreate/MXPredForward/MXPredFree iteration.

What I did notice is that the image-classification sample has a buffer overflow as a result of not properly null-terminating the char* buffer containing the symbolic graph description. I'll see if I can submit a PR to fix that, and if anyone is basing their code on this sample please make sure that the symbol buffer you pass into MXPredCreate is null terminated.

If others are still having problems with leaks, would you be able to post a small sample that replicates the leak? That way others can have a look at your specific sample to determine where this leak is coming from.

@arcadiaphy
Copy link
Member

@KellenSunderland I found the memory leaks in ASAN reports of image classification example mainly come from https://github.com/apache/incubator-mxnet/blob/master/src/common/object_pool.h#L135

Why commented out the free operation?

@stephenrawls
Copy link
Contributor

@arcadiaphy I'm relatively new to the code base so take this with a grain of salt, but my guess is the following:

  1. Memory allocation on GPU is expensive so MxNet doesn't really free it during the life of the program, they re-use it using their custom allocator pool
  2. They could free the memory at program exit when the allocator pool gets destructed but:
    a) Who cares, if the program is exiting anyway, let the OS take care of re-claiming the memory
    b) As the comment in the code you pointed out mentions, they need to be careful that they don't free the memory until after all other destructors that might still have pointers to the memory try to access it; if there are global variables or something then destructor order can be tricky and why bother anyway because of (a)

Actually just dug a little more and saw this comment:
https://github.com/apache/incubator-mxnet/blob/149d8105ea3c4dfd63c3c7e25b3be1e4c4f2ec45/src/engine/threaded_engine.h#L543-L549

That points here:
#309

So my guess is my explanation is probably mostly correct.

@arcadiaphy
Copy link
Member

arcadiaphy commented Feb 15, 2019

@stephenrawls The object pool is used to quickly allocate variables when threaded engine schedules computing operations, so it's not related to GPU memory.

I think the object pools will only be destructed at program exit, so I cannot imagine what memory problems will be caused during destructing. I have re-added the free operation several weeks ago, and no problems are found in my using process.

After fix this memory pool issue, some ASAN memory tests should be set up in CI to avoid memory leaks in C++ interface.

@stephenrawls
Copy link
Contributor

I see, thanks for background. I didn't look closely so just assumed it was about GPU memory.

Re: "I cannot imagine what memory problems will be caused during destruction".

I'm not an expert, but I think that in general destruction order is undefined for global variables and it can cause memory related problems if the memory an object needs during destruction has been freed before its destructor runs. Because the order of destruction is undefined things often seem to run fine until you get unlucky and the order changes.

See the issue I mentioned earlier, or this one: #12613

No idea if it applies here or not though.

@arcadiaphy
Copy link
Member

arcadiaphy commented Feb 15, 2019

@stephenrawls Thanks for mentioning related issues, they are all singleton destructing problems. Crash happens when accessing too early destructed singletons. The main problem of object pools is fixed in #312, maybe there are still some underlying issues.

I think the correct way is to just let problem happens, then we can fix them to approach leak-free codes.

@anirudh2290
Copy link
Member

Thanks for the great discussion! @JohnLee168 and @coolwebgo are you guys still seeing memory leak ? Can you provide a reproducible script for this.

@lanking520
Copy link
Member

@chenyujing1234 Close this issue for inactivity. Please feel free to reopen if problem persist

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests