Check failed: format != mkl_mem_->GetFormat() (5 vs. 5) #10809

dwSun · 2018-05-04T09:55:16Z

Description

Crashed when training a model.

With code from this tutorial, I try to train my own model with MobileNetV2. But it crashed with mxnet-mkl-1.2.0b20180503 from pypi.
On mxnet-mkl-1.1.0 from pypi, this code works.

Batch size 32 and 16 can reproduce this error, others like 8 or 32 seems can't. Smaller network can't reproduce this error.
Not sure this error related to pr #10317 or not.

And maybe this is a same error like issue #10807.

Environment info (Required)

This is the code
crash.zip
Run with

python3 fashion.py

Package used (Python/R/Scala/Julia):

% pip3 list
Package         Version       
--------------- --------------
certifi         2018.4.16     
chardet         3.0.4         
graphviz        0.8.3         
idna            2.6           
mxnet-mkl       1.2.0b20180503
numpy           1.14.3        
pandas          0.22.0        
pip             10.0.1        
pkg-resources   0.0.0         
python-dateutil 2.7.2         
pytz            2018.4        
requests        2.18.4        
setuptools      39.1.0        
six             1.11.0        
urllib3         1.22          
wheel           0.31.0

Error Message:

% python3 fashion.py 
[17:28:49] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 57344 bytes with malloc directly
[17:28:49] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 4096 bytes with malloc directly
[17:28:49] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 172032 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 57344 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 4096 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 172032 bytes with malloc directly
Epoch 0, training loss: 2.55, validation loss: 2.31
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 57344 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 172032 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 1638400 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 1638400 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 57344 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 4096 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 172032 bytes with malloc directly
Epoch 1, training loss: 2.56, validation loss: 2.35
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 57344 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 172032 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 1638400 bytes with malloc directly
[17:28:50] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 1638400 bytes with malloc directly
[17:28:51] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 57344 bytes with malloc directly
[17:28:51] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 4096 bytes with malloc directly
[17:28:51] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 172032 bytes with malloc directly
Traceback (most recent call last):
  File "fashion.py", line 71, in <module>
    valid_loss = cumulative_valid_loss.asscalar()/valid_samples
  File "/home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1894, in asscalar
    return self.asnumpy()[0]
  File "/home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:28:51] src/ndarray/ndarray.cc:351: Check failed: format != mkl_mem_->GetFormat() (5 vs. 5) 

Stack trace returned 10 entries:
[bt] (0) /home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x17009d) [0x7fba25e2f09d]
[bt] (1) /home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x170468) [0x7fba25e2f468]
[bt] (2) /home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2a4a1b8) [0x7fba287091b8]
[bt] (3) /home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2a4a29e) [0x7fba2870929e]
[bt] (4) /home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2899644) [0x7fba28558644]
[bt] (5) /home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x289d151) [0x7fba2855c151]
[bt] (6) /home/david/.virtualenvs/mkl-dnn/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2899d0b) [0x7fba28558d0b]
[bt] (7) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbc90) [0x7fba1ba04c90]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x75aa) [0x7fba37df35aa]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fba36f3ecbf]

The text was updated successfully, but these errors were encountered:

TaoLv · 2018-05-04T13:27:26Z

@dwSun Thanks for reporting this. I will take a look and be back to you soon.

pengzhao-intel · 2018-05-08T14:07:53Z

@dwSun, Tao's PR is merged.
Could you try the case again?

dwSun · 2018-05-08T14:24:05Z

tested with mxnet-mkl-1.2.0b20180508, fashion.py in this issue works well.
But with mxnet-cu91mkl-1.2.0b20180507, I got this error:

[14:53:41] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

Segmentation fault: 11


Segmentation fault: 11


Segmentation fault: 11


Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/david/.virtualenvs/mkl-dnn/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2ec7b2) [0x7f1db68a87b2]
[bt] (1) /home/david/.virtualenvs/mkl-dnn/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2dafa1e) [0x7f1db936ba1e]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f1df499ff20]
[bt] (3) /lib64/ld-linux-x86-64.so.2(+0xcac8) [0x7f1df5b71ac8]
[bt] (4) /lib64/ld-linux-x86-64.so.2(+0x150bd) [0x7f1df5b7a0bd]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0x6f) [0x7f1df4ac82df]
[bt] (6) /lib64/ld-linux-x86-64.so.2(+0x147ca) [0x7f1df5b797ca]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(+0x1663ad) [0x7f1df4ac73ad]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0x6f) [0x7f1df4ac82df]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(_dl_catch_error+0x2f) [0x7f1df4ac836f]

I am using ubuntu18.04 and cuda9.1 installed with:

sudo aptitude install nvidia-cuda-toolkit --without-recommends

should I start a new issue?

roywei · 2018-05-09T21:23:18Z

@sandeep-krishnamurthy could you help to add label MKL? Thanks

pengzhao-intel · 2018-06-06T02:02:56Z

@dwSun I suggest starting a new thread. From the log, don't see anything related MKL-DNN.

dwSun · 2018-06-07T15:24:04Z

tested with mxnet-cu91mkl (1.2.0) from pypi, fashion.py in this issue works well.
But with mxnet-mkl 1.2.0, I got this error:

[23:18:18] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 28672 bytes with malloc directly
[23:18:18] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 4096 bytes with malloc directly
[23:18:18] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 86016 bytes with malloc directly
[23:18:19] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 28672 bytes with malloc directly
[23:18:19] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 4096 bytes with malloc directly
[23:18:19] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 86016 bytes with malloc directly
Traceback (most recent call last):
  File "fashion.py", line 71, in <module>
    valid_loss = cumulative_valid_loss.asscalar()/valid_samples
  File "/home/david/.local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1894, in asscalar
    return self.asnumpy()[0]
  File "/home/david/.local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/david/.local/lib/python3.6/site-packages/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [23:18:19] src/ndarray/ndarray.cc:721: Check failed: !IsMKLDNNData() We can't generate TBlob for MKLDNN data. Please use Reorder2Default() to generate a new NDArray first

Stack trace returned 10 entries:
[bt] (0) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x17ec9d) [0x7fcb3104dc9d]
[bt] (1) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x17f068) [0x7fcb3104e068]
[bt] (2) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x293d945) [0x7fcb3380c945]
[bt] (3) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3e7b73) [0x7fcb312b6b73]
[bt] (4) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3e9cd8) [0x7fcb312b8cd8]
[bt] (5) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28174e6) [0x7fcb336e64e6]
[bt] (6) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x27a3ac2) [0x7fcb33672ac2]
[bt] (7) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x27a3ac2) [0x7fcb33672ac2]
[bt] (8) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x27a3ac2) [0x7fcb33672ac2]
[bt] (9) /home/david/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x27a3ac2) [0x7fcb33672ac2]

This issue is weird...
I am totally confused...
Maybe my script is cursed...
w(ﾟДﾟ)w

TaoLv · 2018-06-07T15:56:47Z

fashion.py works well with master branch which is built with:
make -j20 USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=mkl USE_PROFILER=1
I will try mxnet-mkl 1.2 release and be back to you later.

TaoLv · 2018-06-11T15:24:55Z

@dwSun I tried mxnet-mkl 1.2.0 and seems it indeed has some issues there (but error msg on my side is not as same as yours). PR #11212 is trying to push some mkldnn related fixes into 1.2.0 branch and I find fashion.py works well with code of #11212. Hope it can be merged soon and you try it out then. Sorry for the inconvenient.

pengzhao-intel · 2018-08-22T01:35:37Z

The fix is merged. I think this bug can be closed. @dwSun

TaoLv mentioned this issue May 4, 2018

Fix Reorder2Default #10810

Merged

7 tasks

sandeep-krishnamurthy added the MKL label May 27, 2018

sandeep-krishnamurthy closed this as completed Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check failed: format != mkl_mem_->GetFormat() (5 vs. 5) #10809

Check failed: format != mkl_mem_->GetFormat() (5 vs. 5) #10809

dwSun commented May 4, 2018

TaoLv commented May 4, 2018

pengzhao-intel commented May 8, 2018

dwSun commented May 8, 2018 •

edited

Loading

roywei commented May 9, 2018

pengzhao-intel commented Jun 6, 2018

dwSun commented Jun 7, 2018

TaoLv commented Jun 7, 2018

TaoLv commented Jun 11, 2018 •

edited

Loading

pengzhao-intel commented Aug 22, 2018

Check failed: format != mkl_mem_->GetFormat() (5 vs. 5) #10809

Check failed: format != mkl_mem_->GetFormat() (5 vs. 5) #10809

Comments

dwSun commented May 4, 2018

Description

Environment info (Required)

Error Message:

TaoLv commented May 4, 2018

pengzhao-intel commented May 8, 2018

dwSun commented May 8, 2018 • edited Loading

roywei commented May 9, 2018

pengzhao-intel commented Jun 6, 2018

dwSun commented Jun 7, 2018

TaoLv commented Jun 7, 2018

TaoLv commented Jun 11, 2018 • edited Loading

pengzhao-intel commented Aug 22, 2018

dwSun commented May 8, 2018 •

edited

Loading

TaoLv commented Jun 11, 2018 •

edited

Loading