-
Notifications
You must be signed in to change notification settings - Fork 6.8k
batch_dot
operator crash
#20301
Comments
@bartekkuncer please take a look |
Any news on this? |
Hello @matteosal , sorry for the late response. I tried to reproduce the issue on master branch but had no success. Which branch are you working on? |
I have updated to the latest master and still see the crash. Are you using the same build settings I reported? |
@matteosal almost. The only difference is I used newer version of mkl. Will try with yours. |
@matteosal I tried to reproduce your bug using your exact build config but was unable to. I tried your version of MKL and also a newer and older one - all worked without any problems :( Please try running your test with MKL_VERBOSE flag set e.g. |
Setting I have also tried to build with mkl 2020.1 and got the same crash + some symbol lookup issue:
Which makes me think that I'm doing something wrong for this one. But: since |
What OS and compiler are you using? |
Linux/gcc |
@matteosal Have you tried using
? Also can you point exact OS version? |
@bgawrych I rebuilt everything from scratch with both mkl 2019.4 and 2020.1 and now I see the same exact behaviour (segfault messages without symbol lookup errors) Full versions are Ubuntu 20.04 + gcc 9.3.0 |
@matteosal Are you sure that your $mkl_dir path is proper one? In my environment I have Notice that include_dir have different path than BLAS_LIBRARIES lib vs include |
Yes that's because for some reason our reference internal MKL checkout has a non-standard file layout, all includes and libraries are dumped into the same folder ( @bgawrych can you suggest another python which uses MKL primitives that I can try out? It should give us more information |
I have tried to set Also I've realized that without |
@matteosal use MKL_VERBOSE=1 when you're running your reproduction -
Other operator which uses MKL is for sure LayerNorm, but MXNet must be built with MXNET_USE_MKL_LAYERNORM=1 flag |
Few days ago @bartekkuncer added support for oneDNN batchdot in MXNet, so with oneDNN enabled MKL is not used |
I've rebuilt linking to this MKL (which should be version 2021.2) and still get the same crash. I have also tried building with
|
Did both of you update submodules? |
Yes, just rebuilt again after this to double check |
Might be a good idea for both of you to run this script to report the instruction sets supported too. |
Here is the result:
|
@matteosal I've got reproduction and will try to figure out root cause - is using oneDNN sufficient as workaround right now? |
@matteosal These was my reproduction steps
There is issue with DBLAS_LIBRARIES - after changing this to |
Linking to But this is also telling us that the problem is likely to be about some integer size mismatch. One way to confirm this is to run the example with the library linked to libmkl_rt specifying I'm trying to test these setups myself but I'm having other kinds of unrelated problems blocking me right now. |
I just managed to verify this |
@matteosal Then why you're disabling it in cmake (DUSE_INT64_TENSOR_SIZE)?
This one works for me with large tensor support - tested with tests/nightly/test_large_array.py::test_nn <- modified as it doesn't work right now on master |
I was keeping Anyway, are |
I'm not aware of an explicit check for that combo |
I have this problem on CPU too. In Anaconda base environment it can run |
@chinakook What version of mxnet are you using? Can you provide us with output logs with MKLDNN_VERBOSE and MKL_VERBOSE flags set to 1? (e.g. |
@bartekkuncer I've tested your flags, and it's not relevant to mkl. In my case, Anaconda can run import mxnet as mx
def test_dot():
ctx= mx.cpu(0)
# smaller ndarray can run without error
a = mx.nd.random.uniform(shape=(536, 771, 3), ctx=ctx)
b = mx.nd.random.uniform(shape=(3, 3), ctx=ctx)
c = mx.nd.dot(a, b)
mx.nd.waitall()
print(c.shape)
if __name__ == '__main__':
test_dot() |
@bartekkuncer I used the openblas offered by mxnet officially, and It's fixed. Thanks. |
batch_dot
seems completely broken.Running this script produces:
Maybe the problem is in my build. I'm building master from source with these settings (linking to MKL 2019.4):
The text was updated successfully, but these errors were encountered: