Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKLDNN RNN seg fault #19265

Closed
Zha0q1 opened this issue Oct 1, 2020 · 10 comments
Closed

MKLDNN RNN seg fault #19265

Zha0q1 opened this issue Oct 1, 2020 · 10 comments

Comments

@Zha0q1
Copy link
Contributor

Zha0q1 commented Oct 1, 2020

A customer is experiencing seg fault when feeding in a large input to MKL LSTM. I have reduced the code to this:

import mxnet as mx
from mxnet import gluon, nd, autograd
from mxnet.gluon import nn, rnn, Trainer

hidden_size = 30
num_embed = 100
vocab_size = 13028#len(vocab.token_to_idx.keys())

inp = nd.random.uniform(0, vocab_size, (16758,500))
print(inp)

context = mx.cpu()

model = nn.Sequential()
model.add(nn.Embedding(vocab_size, num_embed), # Embedding layer
          rnn.LSTM(hidden_size, num_layers=1,bidirectional=True),  # Recurrent layer ,bidirectional=True
          nn.Dense(3))  # Output layer

model.collect_params().initialize(mx.init.Xavier(), ctx=context)

val_predictions = model(inp)
nd.waitall()
print(val_predictions)

I think this is some sort of out of memory issue because if we shrink the input (first dim of inp) then there will not be a seg fault, but still, shall we add some error message here so that users will be notified to reduce the input size?

I also noticed the same input will run fine with export MXNET_USE_MKLDNN_RNN=0 but that is 3x slower than the mkldnn implementation. Another suggestion I made to the customer was to try out a magic number for the seg fault threshold and do multiple batches that are smaller than that (customer was trying to forward pass the entire validation set), but this is also a pretty hacky solution. So maybe better yet, we can optimize the mkldnn implementation to process data that's currently too large?

@PatricZhao

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Oct 1, 2020

seg fault:

Segmentation fault: 11

terminate called without an active exception
Aborted (core dumped)

GDB:


Thread 9 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffbac26700 (LWP 18164)]
bt
0x00007fff9c0743f0 in ?? ()
(gdb) bt
#0  0x00007fff9c0743f0 in ?? ()
#1  0x00007fffe5e905ec in float** dnnl::impl::memory_tracking::grantor_t::get<float*>(unsigned int const&) const
    () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#2  0x00007fffe5e93697 in dnnl::impl::cpu::_ref_rnn_common_t<(dnnl_prop_kind_t)64, (dnnl_data_type_t)3, (dnnl_data_type_t)3, (dnnl_data_type_t)3>::execute_(dnnl::impl::exec_ctx_t const&) const ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#3  0x00007fffe5d05de9 in dnnl::impl::cpu::_ref_rnn_common_t<(dnnl_prop_kind_t)64, (dnnl_data_type_t)3, (dnnl_data_type_t)3, (dnnl_data_type_t)3>::execute(dnnl::impl::exec_ctx_t const&) const ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#4  0x00007fffe5890788 in dnnl_primitive_execute ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#5  0x00007fffe0a5eb1a in mxnet::MKLDNNStream::Submit(bool) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#6  0x00007fffe0b13343 in mxnet::op::MKLDNNRnnOp::Forward(mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#7  0x00007fffe5306633 in mxnet::op::RNNStatefulComputeExCPU(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#8  0x00007fffe4f503fd in mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#9  0x00007fffe4f506cd in std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::O---Type <return> to continue, or q <return> to quit---
pStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#2}>::_M_invoke(std::_Any_data const&, mxnet::RunContext) () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#10 0x00007fffe501d754 in std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext, mxnet::engine::CallbackOnComplete) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#11 0x00007fffe50180a5 in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#12 0x00007fffe502a294 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#13 0x00007fffe5016934 in std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#14 0x00007fffded79421 in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
#15 0x00007ffff7bbd6db in start_thread (arg=0x7fffbac26700) at pthread_create.c:463
#16 0x00007ffff78e6a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@sandeep-krishnamurthy
Copy link
Contributor

@TaoLv @ciyongch @PatricZhao - Hello guys. Can you please help in this issue. We saw atleast 2 production users impacted by this and USE_MKLDNN=0 was temp fix, but performance is really bad as expected. This is a blocker.

@sandeep-krishnamurthy
Copy link
Contributor

@anko-intel

@mozga-intel
Copy link
Contributor

Thanks, @Zha0q1 @sandeep-krishnamurthy! I have a look at this issue.

@mozga-intel
Copy link
Contributor

mozga-intel commented Oct 5, 2020

@Zha0q1 Could you please tell me a little bit more details about this issue, such as the branch name and its commit sha and what version of MKLDNN you have (commit-sha)? Thanks!

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Oct 5, 2020

I am using mxnet 1.7 (https://github.com/apache/incubator-mxnet/releases/tag/1.7.0) from .pip install mxnet. The machine was a C5.9xlarge DLAMI Ubuntu 18 EC2 instance.

@mozga-intel
Copy link
Contributor

mozga-intel commented Oct 13, 2020

Hi,

Well,
When running our pre-model (this is a simple imitation of the LSTM model). While a test, I want to create a large LSTM tensor, for example: (20758,500). It could be seen that ~ 170GB of memory is allocated for scratchpad computations. We can see that global memory is always true. Well, as a result, for a different oneDNN version, I got the error messages: as following:

  1. For a given v.1.3 version of mkldnn: Segmentation fault: 11
  2. For a given v.1.6 version of mkldnn: mxnet.base.MXNetError: MXNetError: could not create a primitive

This error is only visible for a large LSTM tensor. Step-by-step reproduction casts light on this issue. If we have a look at the code, a lot of things might be visible there. First off, the standard Vanilla-LSTM algorithm of MKLDNN leads to allocate sufficient/insufficient block of memory. The block is allocated based on this equation: sizeof(float) * work_space, where work_space is an offset (in bytes). For a given test (input: 20758,500) we can see that ~170 GB od memory is allocated for scratchpad computation, where workspace = 47952392192 * sizeof(float) = 191809568768 bytes ~ 170 GB. If you don't have enough space, you will get both errors: see 1 & 2. In Intel, MKLDNN primitives can use either individual memory or global buffer memory for an intermediate computation. The first one might lead to getting better performance result since memory most likely will be attached to any thread. The second one, might save a lot of memory.

For brevity:
The input tensor is T x N x C, well, for a given example (10758, 500), T is 10758, C is 500, That means that we need at least 4 * 10758 * 500 * 500 * 4 bytes ~ 40 GB, or maybe more. Basically the work-space would be comparable with the grid size n_layers * mb * n_times_stamps * 4 (gates) * max(sic, slc, dhsc) ^ 2.  For a given oneDNN version (1.3 and 1.6) the size of work-space (i.e LSTM space) is equal book<float>(num_elems, ....) ~ 40 GB * sizeof(T) = 40 GB * 4 ~160GB.  An upper_bound (the size of input tensor) has not been clearly defined and its upper_bound has been limited by physical side of memory.  Well, the size of buffer which is need to allocate LSTM tensor is determined, as following: 4 * 10758 * 500 * 500 * 4 bytes ~ 40 GB.  Yet, this value is multiply by the constant value of its type (in this case: = float). 
Approximately: it should be defined, as following: 

  1. The size of work-space * , where is <uint8_t> ~ * 1byte [potentially]
  2. The workspace is only limited by the total number of elements of a given tensor. 

An upper_bound of a given tensor is equal (the upper-bound of LSTM)
n^2 * m = memory_space / (16 bytes)

@mozga-intel
Copy link
Contributor

mozga-intel commented Oct 20, 2020

Hi @Zha0q1
There’s a bug in oneDNN LSTM forward inference that results in using ~4x more memory for LSTM workspace in inference cases.
Could you please tell me whether this addressing (look at the table), is acceptable and it allows you to resolve any issues?

  (dim: 20756, 500) Before After
The total size of memory needed to allocate LSTM tensor 230 GB (~4x more memory) 56 GB (~4x less memory)

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Oct 20, 2020

@mozga-intel Thanks for you investigation! Yes, this improvement is huge and will help our users who run inference tasks on pre-trained models. It would be great to include this fix in the next oneDNN release

@pengzhao-intel
Copy link
Contributor

@TaoLv @ciyongch @PatricZhao - Hello guys. Can you please help in this issue. We saw atleast 2 production users impacted by this and USE_MKLDNN=0 was temp fix, but performance is really bad as expected. This is a blocker.

Sorry for that and the team is working on fixing any possible issues. Feel free to ping us for any issue :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants