-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MXNet C++Interface reasoning leads to CPU memory leak #13265
Comments
@mxnet-label-bot add [C++, Memory, Bug, Pending Requester Info] Thanks for raising the issue @chenyujing1234. Can you provide a bit more info so we can better help? Environment info (Required)
Build info (If built from source)Compiler (gcc/clang/mingw/visual studio): MXNet commit hash: Build config: Minimum reproducible exampleCan you share the code for the program that you used to test the memory leak? |
@chenyujing1234 Would you be able to post some sample code, that when run in a loop will leak? If we can run this for a fixed number of iterations (say 1234 loops) we should then be able to look with an ASAN build and see what's leaking. |
@mxnet-label-bot update [Bug, C API, Memory, Pending Requester Info] |
You can get my test code from this address. TKS |
|
You can get my test code from this address. TKS |
You can get my test code from this address. TKS |
Great bug report @chenyujing1234. Really appreciate the details. I'll take a look at this on a best-effort basis (might take a while to verify the leaks on my end). If anyone else from the community has time to jump on this I think it'd be very useful. I've seen a lot of reports of leaks from users. |
Are you using a mxnet version that contain MKL-DNN? |
No |
Now my project is stuck here, waiting for your reply. Thank you |
Any chance you could paste the reproducing code on gist.github.com? |
I am sorry , I can't visit gist.github.com. Can you visit https://pan.baidu.com/s/19Wtd_Cf1BGF-2MS3le1mYA |
Looks like I need to install an app to access the files, but no worries, I can likely download the code with a colleague tomorrow. |
@KellenSunderland |
I Build ASAN versin, and test, Found memory leak. |
Great work, looks like the ASAN build is setup right and detecting errors. What I would do now is run the offending code in a loop with a fixed number of iterations, then look for a leak that has a multiple of that number. If there's a genuine leak it should be present there. For example you could run MXPredCreate, MXPredReshape, MXPredForward and then MXPredFree in a loop 123 times, then look for mem leaks with a multiple of 123 instances. This guide may also help you: |
|
Has any result ? |
I've solved one the memory leak problem caused by using reshape functions in the CPU version.
|
Now GPU reasoning will still have memory leaks, troubled for two weeks, now want to go online, but stuck here. I don't know what you got there? |
You can drop reshape to avoid this. |
I found the same issue. Every time I do reshape first and Free the reshaped handle. However, mem usage increased very quickly. My mxnet version is 1.2.1rc1, built with mkldnn. The issue both appeared in mac os and ubuntu. @chinakook If drop reshape, must call MXPredCreate every time before do infer. Would there be a performance issue? Or it seems that i could prepare a few fixed size models for dynamic input. |
Create a fixed model for the biggest size of input image, and pad every image to this biggest one. |
Finally, it was found that because the original Pred object was not released when Reshape Function was called, it would not be leaked if it was released. |
@chenyujing1234 can you please let us know if the issue is fixed? If it is we can close this issue. |
@mxnet-label-bot add [Pending Requester Info] |
I think this should have been fixed by #13376 (we may want to consider backporting it given the number of reports). |
I have the same issue, I just use MXPredCreate and MXPredFree everytime, but still got cpu and gpu memory leak. it is slow but still leaks memory. |
you mean we should manually release the former handle after reshape, and manually release reshape handle after use? |
I also got the memory leak , call like this : , when I mask the call : MXPredForward(), the mem leak will not turn up(memory use do not grow). |
I tried to reproduce the memory leak with an ASAN build and don't see anything that jumps out as a leak. There's a number of data structures allocated and then not released, but these structures don't seem to increase over time/iterations. My could sample runs a simplification of the image-classification demo. A snippet of the code tested for leaks is here: https://github.com/apache/incubator-mxnet/blob/8e2c0adb61b5f5ceee4d090f1413c8697a61e008/example/image-classification/predict-cpp/image-classification-predict.cc The results reported by ASAN are here: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/miscellaneous/branches/PR-13917/runs/14/nodes/137/log/?start=0 What I would expect to see if there was a genuine leak in my code would be an ASAN leak summary that would report leaked objects that are some multiple of '1234', my iteration size. I don't see any objects that look like they were leaked as part of a MXPredCreate/MXPredForward/MXPredFree iteration. What I did notice is that the image-classification sample has a buffer overflow as a result of not properly null-terminating the char* buffer containing the symbolic graph description. I'll see if I can submit a PR to fix that, and if anyone is basing their code on this sample please make sure that the symbol buffer you pass into MXPredCreate is null terminated. If others are still having problems with leaks, would you be able to post a small sample that replicates the leak? That way others can have a look at your specific sample to determine where this leak is coming from. |
@KellenSunderland I found the memory leaks in ASAN reports of image classification example mainly come from https://github.com/apache/incubator-mxnet/blob/master/src/common/object_pool.h#L135 Why commented out the free operation? |
@arcadiaphy I'm relatively new to the code base so take this with a grain of salt, but my guess is the following:
Actually just dug a little more and saw this comment: That points here: So my guess is my explanation is probably mostly correct. |
@stephenrawls The object pool is used to quickly allocate variables when threaded engine schedules computing operations, so it's not related to GPU memory. I think the object pools will only be destructed at program exit, so I cannot imagine what memory problems will be caused during destructing. I have re-added the free operation several weeks ago, and no problems are found in my using process. After fix this memory pool issue, some ASAN memory tests should be set up in CI to avoid memory leaks in C++ interface. |
I see, thanks for background. I didn't look closely so just assumed it was about GPU memory. Re: "I cannot imagine what memory problems will be caused during destruction". I'm not an expert, but I think that in general destruction order is undefined for global variables and it can cause memory related problems if the memory an object needs during destruction has been freed before its destructor runs. Because the order of destruction is undefined things often seem to run fine until you get unlucky and the order changes. See the issue I mentioned earlier, or this one: #12613 No idea if it applies here or not though. |
@stephenrawls Thanks for mentioning related issues, they are all singleton destructing problems. Crash happens when accessing too early destructed singletons. The main problem of object pools is fixed in #312, maybe there are still some underlying issues. I think the correct way is to just let problem happens, then we can fix them to approach leak-free codes. |
Thanks for the great discussion! @JohnLee168 and @coolwebgo are you guys still seeing memory leak ? Can you provide a reproducible script for this. |
@chenyujing1234 Close this issue for inactivity. Please feel free to reopen if problem persist |
I am using the c++ interface of mxnet to carry out the reasoning of mtcnn algorithm for face detection.
Environment: Ubuntu16.04.1 + GPU cuda9.0 + MXNet1.3.0
I use the interface: MXPredCreate MXPredReshape MXPredForward MXPredFree...
When I used a lot of pictures for stress testing (running for a long time), I found that my process occupied more and more CPU memory, and eventually it occupied all the memory.
My process was forced to kill by the system.
Later, I wrote the program To Test:
After 100 thousand cycles, memory will increase by 1G
Do you want to invite MXNet's main developers to find this problem? Please track it down for me. It has been bothering me for almost a week.
The text was updated successfully, but these errors were encountered: