-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Using MKL causes C++ layer blow up while running lstm_bucketing example #5314
Comments
find the problem
|
@glingyan Thank you! I am training char-level rnn on Intel Xeon E5-2666 v3 (Haswell) |
@sergeykolychev sure , there will a big upstream patch these day |
Does mkl work for mac? If it does then we should change the tutorial @glingyan Do we need the user to have a full mkl installation for using BLAS=mkl? |
@piiswrong I tried to use MKL on mac and it did not compile, it has different api compared to mklml that we use on Linux. I also was unable to find mklml for mac, though I saw reports on google that some individuals were able to compile mklml from the sources on OSX but did not pursue that route yet. |
I think blas defaults to apple on for osx.mk. Or at least it used to |
@piiswrong yes, it does default to 'apple' on osx, which is correct behavior, however we are probably doing disservice to linux not defaulting to mklml. Even if wide-spread usage of mkl will lead to some issues , it's a good thing, cause they'll get quickly fixed seeing how responsive @glingyan is. |
@piiswrong , @glingyan I want to apologize and correct myself, the 3.46 I was getting with openblas were related to problems on my end, not to openblas, while MKL is still faster than openblas but the difference is not drastic. |
@sergeykolychev there will be a fix for converage on some model tonight or tomorrow , please waiting my patch , upstream test is on going |
@glingyan thank you, will wait. |
@sergeykolychev please check preview at https://github.com/glingyan/mxnet |
@zhenlinluo for mkl on MAC issue |
@glingyan , the issues are not fixed, here is what I see. my code is really basic char lstm rnn network and the data is tiny shakespreare. It's written in perl but frankly I do not think it matters.
And this is the same code with USE_BLAS=openblas
As you can see when MKL is used it starts twice as fast compared to OpenBLAS, but then in the middle of the second epoch slows to a crawl, as well the perplexity metric gets stuck at ~ 11 |
@sergeykolychev why closed ,not problem for you now? |
@glingyan sorry, thought you do not need it anymore, of course the issue still exist if you did not add new code since that preview. Though I moved my calculations to GPU box, so not really concerned with MKL right now. |
@sergeykolychev will help to debug, but where to setup the env , or use example/rnn is enough ? |
@glingyan Thanks!
From the output below you can see that in the middle of second epoch the performance starts to degrade and the network is not converging
|
This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks! |
For bugs or installation issues, please provide the following information.
The more information you provide, the more likely people will be able to help you.
WARNING: discarded 89 sentences longer than the largest bucket.
WARNING: discarded 4 sentences longer than the largest bucket.
[01:06:12] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:300: [01:06:12] src/operator/./mkl/mkl_concat-inl.h:196: Check failed: e == E_SUCCESS (-1 vs.
0)
Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fa00e11cc1c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op11MKLConcatOpIN7mshadow3cpuEfE7ForwardERKNS_9OpContextERKSt
6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0xc10) [0x7fa00ecfd950]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(+0xec2092) [0x7fa00ed9a092]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0
8OprBlockE+0x8c) [0x7fa00ed5531c]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice1
3PushToExecuteEPNS2_8OprBlockEbENKUlvE_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x2e) [0x7fa00ed57bbe]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fa006208c80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fa01cadf6ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fa01c81582d]
[01:06:12] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:300: [01:06:12] src/engine/./threaded_engine.h:336: [01:06:12] src/operator/./mkl/mkl_concat-i
nl.h:196: Check failed: e == E_SUCCESS (-1 vs. 0)
Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fa00e11cc1c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op11MKLConcatOpIN7mshadow3cpuEfE7ForwardERKNS_9OpContextERKSt
6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0xc10) [0x7fa00ecfd950]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(+0xec2092) [0x7fa00ed9a092]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0
8OprBlockE+0x8c) [0x7fa00ed5531c]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice1
3PushToExecuteEPNS2_8OprBlockEbENKUlvE_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x2e) [0x7fa00ed57bbe]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fa006208c80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fa01cadf6ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fa01c81582d]
Environment info
Operating System:
LInux Ubuntu 16.04
Compiler:
gcc 4.8
Package used (Python/R/Scala/Julia):
Python
MXNet version:
0.9.4
Or if installed from source:
MXNet commit hash (
git rev-parse HEAD
):55bb4cd
If you are using python package, please provide
Python version and distribution:
python2.7
If you are using R package, please provide
R
sessionInfo()
:Error Message:
Please paste the full error message, including stack trace.
Minimum reproducible example
if you are using your own code, please provide a short script that reproduces the error.
~/mxnet/example/rnn$ python lstm_bucketing.py
Steps to reproduce
or if you are running standard examples, please provide the commands you have run that lead to the error.
What have you tried to solve it?
The text was updated successfully, but these errors were encountered: