-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MKLDNN RNN seg fault #19265
Comments
seg fault:
GDB:
|
@TaoLv @ciyongch @PatricZhao - Hello guys. Can you please help in this issue. We saw atleast 2 production users impacted by this and USE_MKLDNN=0 was temp fix, but performance is really bad as expected. This is a blocker. |
Thanks, @Zha0q1 @sandeep-krishnamurthy! I have a look at this issue. |
@Zha0q1 Could you please tell me a little bit more details about this issue, such as the branch name and its commit sha and what version of MKLDNN you have (commit-sha)? Thanks! |
I am using mxnet 1.7 (https://github.com/apache/incubator-mxnet/releases/tag/1.7.0) from |
Hi, Well,
This error is only visible for a large LSTM tensor. Step-by-step reproduction casts light on this issue. If we have a look at the code, a lot of things might be visible there. First off, the standard Vanilla-LSTM algorithm of MKLDNN leads to allocate sufficient/insufficient block of memory. The block is allocated based on this equation: For brevity:
An upper_bound of a given tensor is equal (the upper-bound of LSTM) |
Hi @Zha0q1
|
@mozga-intel Thanks for you investigation! Yes, this improvement is huge and will help our users who run inference tasks on pre-trained models. It would be great to include this fix in the next oneDNN release |
Sorry for that and the team is working on fixing any possible issues. Feel free to ping us for any issue :) |
A customer is experiencing seg fault when feeding in a large input to MKL LSTM. I have reduced the code to this:
I think this is some sort of out of memory issue because if we shrink the input (first dim of
inp
) then there will not be a seg fault, but still, shall we add some error message here so that users will be notified to reduce the input size?I also noticed the same input will run fine with
export MXNET_USE_MKLDNN_RNN=0
but that is 3x slower than the mkldnn implementation. Another suggestion I made to the customer was to try out a magic number for the seg fault threshold and do multiple batches that are smaller than that (customer was trying to forward pass the entire validation set), but this is also a pretty hacky solution. So maybe better yet, we can optimize the mkldnn implementation to process data that's currently too large?@PatricZhao
The text was updated successfully, but these errors were encountered: