Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

uncomment memory pool free #14176

Closed
wants to merge 4 commits into from
Closed

Conversation

arcadiaphy
Copy link
Member

Description

When detecting memory leaks on c++ inference code using ASAN, I have found that almost all of the leaks in ASAN reports come from unreleased memory in object pool. The free code is deliberately commented out to avoid program crash from accessing too early destructed objects in global singletons.

The main problem of object pool is fixed in #312, maybe there are still some underlying issues. I have re-added the free operation, and experienced no problems in several weeks' usage.

I think the correct way is to just let problems happen, then we can fix them to approach leak-free codes.

Since issues like #13265 are reported, some ASAN tests are suggested to be added in CI to assure absolutely no leaks in c++ interface.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@marcoabreu
Copy link
Contributor

Hi,

thanks a lot for your input. We already have an ASAN setup ready in CI, but it is set as non blocking:

https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L409
https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L1021

I'd recommend to enable the blocking for memory leaks and then incrementally work in your PR to address these leaks until CI is green. What do you think?

@marcoabreu
Copy link
Contributor

@KellenSunderland fyi

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Feb 15, 2019

Yep, agree with the assessment here. If we can ensure properly shutdown we can re-enable these frees and fix these ASAN reports. I also wonder if we can't use smart pointers to manage the lifecycle of the pools so that we don't have to be so careful when shutting down threadpools / engine threads.

@ankkhedia
Copy link
Contributor

@arcadiaphy Thanks for the contribution!

@mxnet-label-bot add [pr-awaiting-review, C API]

@marcoabreu marcoabreu added C API pr-awaiting-review PR is waiting for code review labels Feb 15, 2019
@ankkhedia
Copy link
Contributor

@leleamol Ping for review!

@arcadiaphy
Copy link
Member Author

@marcoabreu Yep, let me have a look on the ASAN tests.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Feb 19, 2019

@marcoabreu @KellenSunderland
I have played with ASAN test of mlp_cpu example, there are roughly two kinds of memory leaks:

  1. When using threaded engine, the memory release is depended upon workers in engine, and thus asynchronous. When program exits, the destruction of engine will not wait for all operations including memory release to finish, leading to leaked memory. This kind of leak is not very serious since it only happens on exit, and is more like ungraceful engine cleanup.
  2. True memory leak that happens elsewhere.

To fully address the 1st leak is hard: many memories are hidden in global/thread_local singleton, so it's related to enforce a correct destruction order on static variables (the engine is also a static variable) and make engine to wait for unfinished operations. Actually, I don't find any mechanisms concerning static variable destruction order in mxnet, so I wonder why there is no problem in usage.

A easy workaround is to use naive engine in ASAN tests, if leaks still exists, then it's definitely the 2nd leak and we must fix them.

Correct me if I'm wrong, especially on static variable part.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C API pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants