Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Large tensor nightly test memory failures #16447

Closed
ChaiBapchya opened this issue Oct 11, 2019 · 2 comments
Closed

Large tensor nightly test memory failures #16447

ChaiBapchya opened this issue Oct 11, 2019 · 2 comments

Comments

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Oct 11, 2019

Currently, upon running large tensor tests (CPU specific), we see memory footprint that exceeds the available memory (where nightly tests are run - C5 instance with <150G)

Thus we see errors like

ERROR: test_large_array.test_ndarray_random_multinomial
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/incubator-mxnet/tests/python/unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/tests/nightly/test_large_array.py", line 126, in test_ndarray_random_multinomial
    assert a[0][0][0][0] >= 0
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 401, in __bool__
    return bool(self.asscalar())
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2524, in asscalar
    return self.asnumpy()[0]
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2506, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [07:09:24] ../src/storage/./cpu_device_storage.h:75: Failed to allocate CPU Memory
Stack trace:
  [bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f7e900bf7e2]
  [bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::storage::CPUDeviceStorage::Alloc(mxnet::Storage::Handle*)+0x91) [0x7f7e92586ef1]
  [bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x5a) [0x7f7e9258634a]
  [bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::NDArray::CheckAndAlloc() const+0x10d) [0x7f7e900c16ad]
  [bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(+0x935ef8) [0x7f7e90230ef8]
  [bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std
::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::R
unContext)#1}::operator()(mxnet::RunContext) const+0x1ce) [0x7f7e9024690e]
  [bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Con
text const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::alloca
tor<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7f7e90246be7]
  [bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(+0x8a67ae) [0x7f7e901a17ae]
  [bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x500) [0x7f7e901ad210]


-------------------- >> begin captured logging << --------------------
tests.python.unittest.common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1814393275 to reproduce.
--------------------- >> end captured logging << ---------------------

Attempts made -
Despite calling nd.waitall and empty_cache(), memory footprint continues to grow gradually (as nosetests proceed 1 test after another). For reference this commit -
c48f70f

TODO:

  • C API to release CPU memory (similar to ReleaseAll)
    @anirudh2290 pointed out
    empty_cache() is a No Op for CPU context.
    So, need to get correct implementation for CPU specific usecases.

Large tensor tests are maintained here

https://github.com/apache/incubator-mxnet/blob/master/tests/nightly/test_large_array.py
https://github.com/apache/incubator-mxnet/blob/master/tests/nightly/test_large_vector.py

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Test

@ChaiBapchya
Copy link
Contributor Author

Duplicate of #14980

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants