FullyConnected op with float64 and MKL-DNN fails if gradient are not set in a specific way #15767

matteosal · 2019-08-06T16:56:28Z

Description

With MKL-DNN and float64 arrays, getting the output of a FullyConnected op after a forward pass fails unless the gradient update method is not 'null' and explicit gradient arrays are specified (even though no backward pass is involved).

Environment info (Required)

----------Python Info----------
Version      : 3.7.2
Compiler     : GCC 7.3.0
Build        : ('default', 'Dec 29 2018 06:19:36')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.1
Directory    : /opt/Anaconda/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /home/matteo/Git/mxnet/python/mxnet
Commit hash file "/home/matteo/Git/mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✔ F16C
✔ JEMALLOC
✖ BLAS_OPEN
✔ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✖ LAPACK
✔ MKLDNN
✖ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✖ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
Platform     : Linux-4.15.0-55-generic-x86_64-with-debian-buster-sid
system       : Linux
node         : mongolius
release      : 4.15.0-55-generic
version      : #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Stepping:            3
CPU MHz:             2700.253
CPU max MHz:         3500,0000
CPU min MHz:         800,0000
BogoMIPS:            5184.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0117 sec, LOAD: 0.8935 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0599 sec, LOAD: 2.1901 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1028 sec, LOAD: 0.9832 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0657 sec, LOAD: 1.2597 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0380 sec, LOAD: 0.8543 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0395 sec, LOAD: 0.4625 sec.

Package used: python

Build info (Required if built from source)

Compiler: gcc

MXNet commit hash: 3255d87

Build config: plain config.mk with USE_OPENCV=0

Error Message:

Traceback (most recent call last):
  File "script.py", line 30, in <module>
    print(ex.outputs[0])
  File "/home/matteo/Git/mxnet/python/mxnet/ndarray/ndarray.py", line 194, in __repr__
    return '\n%s\n<%s %s @%s>' % (str(self.asnumpy()),
  File "/home/matteo/Git/mxnet/python/mxnet/ndarray/ndarray.py", line 2096, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/matteo/Git/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:46:52] src/operator/subgraph/mkldnn/../.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h:217: unknown type for MKLDNN
Stack trace:
  [bt] (0) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7fdeee8a4fc3]
  [bt] (1) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::SgMKLDNNFCOp::Forward(mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x902) [0x7fdeee92ece2]
  [bt] (2) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::StatefulComputeExExecutor::Run(mxnet::RunContext, bool)+0x2d1) [0x7fdef0e0ec81]
  [bt] (3) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2b00ead) [0x7fdef0dcaead]
  [bt] (4) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2b0103f) [0x7fdef0dcb03f]
  [bt] (5) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x585) [0x7fdef16810a5]
  [bt] (6) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x147) [0x7fdef16944d7]
  [bt] (7) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()+0x4e) [0x7fdef167f7ce]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fdee656c66f]

Minimum reproducible example

import mxnet as mx

sym = mx.sym.FullyConnected(
	mx.sym.Variable('in'), 
	mx.sym.Variable('w'), 
	mx.sym.Variable('b'), 
	num_hidden=2
)

dtype = 'float64'
explicit_grad = {
	'in': mx.nd.array([[2, 3, 4]], dtype=dtype),
	'w': mx.nd.array([[1, 2, 3], [4, 5, 6]], dtype=dtype),
	'b': mx.nd.array([7, 8], dtype=dtype)
}

args_grad = explicit_grad
grad_req = 'write'

ex = sym.bind(mx.cpu(), 
	{
		'in': mx.nd.array([[2, 3, 4]], dtype = dtype),
		'w': mx.nd.array([[1, 2, 3], [4, 5, 6]], dtype = dtype),
		'b': mx.nd.array([7, 8], dtype = dtype)
	},
	args_grad = args_grad,
	grad_req = grad_req
)
ex.forward()
print(ex.outputs[0])

The above script works, but setting args_grad = None or grad_req = 'null' (or both) makes it fail with this error:

src/operator/subgraph/mkldnn/../.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h:217: unknown type for MKLDNN

Every combination used to work in commit 076b2f3

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-08-06T16:56:31Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

vdantu · 2019-08-06T23:28:37Z

@mxnet-label-bot add [Bug]

pengzhao-intel · 2019-08-07T05:27:00Z

@wuxun-zhang please take a look for this bug.

wuxun-zhang · 2019-08-07T08:42:52Z

@matteosal Thanks for reporting this issue. I can reproduce this issue locally. Firstly, float64 is not supported yet for current MKL-DNN implementation, so actually mkl-dnn pass should not be executed in this example and there should be missing or imcomplete datatype check somewhere. Additionally, grad_req is dependent on args_grad, so grad_req is always kNullOp when args_grad=None (see #L167).

matteosal · 2019-08-07T10:33:26Z

I also get the same problem with RNN, but setting explicit gradients doesn't help in this case. It seems completely broken on float64:

import mxnet as mx

sym = mx.sym.RNN(
	mx.sym.Variable('in'), 
	mx.sym.Variable('par'), 
	mx.sym.Variable('s'), 
	state_size = (2),
	num_layers = 1,
	mode = 'rnn_tanh'
)

dtype = 'float64'
explicit_grad = {
	'in': mx.nd.ones([2, 1, 2], dtype=dtype),
	'par': mx.nd.ones([12], dtype=dtype),
	's': mx.nd.ones([1, 1, 2], dtype=dtype)
}

args_grad = explicit_grad
grad_req = 'write'

ex = sym.bind(mx.cpu(), 
	{
		'in': mx.nd.ones([2, 1, 2], dtype=dtype),
		'par': mx.nd.ones([12], dtype=dtype),
		's': mx.nd.ones([1, 1, 2], dtype=dtype)
	},
	args_grad = args_grad,
	grad_req = grad_req
)
ex.forward()
print(ex.outputs[0])

Other RNN modes besides 'rnn_tanh' are also affected.

pengzhao-intel · 2019-08-07T13:07:05Z

@wuxun-zhang let's double-check all data type in MKLDNN backend. Maybe fix should be in 1.5.1. @TaoLv

wuxun-zhang · 2019-08-07T13:51:30Z

Seems that there are no data type check for MKL-DNN stateful RNN implementation (see https://github.com/apache/incubator-mxnet/blob/master/src/operator/rnn.cc#L226). So, when input data is float64, mkldnn rnn pass still be executed and then unknown mkldnn type error will be got.

zixuanweeei · 2019-08-07T14:37:34Z

Seems that there are no data type check for MKL-DNN stateful RNN implementation (see https://github.com/apache/incubator-mxnet/blob/master/src/operator/rnn.cc#L226). So, when input data is float64, mkldnn rnn pass still be executed and then unknown mkldnn type error will be got.

The execution trace of RNN is maked out as below.

https://github.com/apache/incubator-mxnet/blob/71861238743fcd8177afe52d1562d9078ac547de/src/operator/rnn.cc#L254

https://github.com/apache/incubator-mxnet/blob/71861238743fcd8177afe52d1562d9078ac547de/src/operator/nn/mkldnn/mkldnn_base-inl.h#L206-L220

ZhennanQin · 2019-08-08T00:32:57Z

It's not all about float64, but about MKLDNN subgraph backend. The problem is, recently we enabled MKLDNN subgraph backend by default on master, and this will break the fallback mechanism when handing float64. So for nightly build from master, please use export MXNET_SUBGRAPH_BACKEND=NONE to work around shortly, for MXNet v1.5.0, please unset MXNET_SUBGRAPH_BACKEND.

ZhennanQin · 2019-08-08T01:02:40Z

@pengzhao-intel @TaoLv v1.5.0 doesn't have this issue. So don't need to fix in v1.5.1.

wuxun-zhang · 2019-08-08T01:10:49Z

@ZhennanQin Can we add data type check here #L1663 to disable subgraph when input data type is not supported by MKL-DNN?

pengzhao-intel · 2019-08-08T01:21:27Z

@pengzhao-intel @TaoLv v1.5.0 doesn't have this issue. So don't need to fix in v1.5.1.

It's nice and we can try to resolve in 1.6.

pengzhao-intel · 2019-09-05T11:40:18Z

@matteosal sorry for the delay. The PR is blocked by 3rd party package but it is resolved and will be merged soon.

marcoabreu added the Bug label Aug 6, 2019

pengzhao-intel added the MKLDNN label Aug 7, 2019

ZhennanQin mentioned this issue Aug 12, 2019

Float64 fallback for mkldnn subgraph and rnn op #15853

Merged

7 tasks

pengzhao-intel closed this as completed in #15853 Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FullyConnected op with float64 and MKL-DNN fails if gradient are not set in a specific way #15767

FullyConnected op with float64 and MKL-DNN fails if gradient are not set in a specific way #15767

matteosal commented Aug 6, 2019

mxnet-label-bot commented Aug 6, 2019

vdantu commented Aug 6, 2019

pengzhao-intel commented Aug 7, 2019

wuxun-zhang commented Aug 7, 2019

matteosal commented Aug 7, 2019

pengzhao-intel commented Aug 7, 2019

wuxun-zhang commented Aug 7, 2019

zixuanweeei commented Aug 7, 2019

ZhennanQin commented Aug 8, 2019 •

edited

Loading

ZhennanQin commented Aug 8, 2019

wuxun-zhang commented Aug 8, 2019

pengzhao-intel commented Aug 8, 2019

pengzhao-intel commented Sep 5, 2019

FullyConnected op with float64 and MKL-DNN fails if gradient are not set in a specific way #15767

FullyConnected op with float64 and MKL-DNN fails if gradient are not set in a specific way #15767

Comments

matteosal commented Aug 6, 2019

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example

mxnet-label-bot commented Aug 6, 2019

vdantu commented Aug 6, 2019

pengzhao-intel commented Aug 7, 2019

wuxun-zhang commented Aug 7, 2019

matteosal commented Aug 7, 2019

pengzhao-intel commented Aug 7, 2019

wuxun-zhang commented Aug 7, 2019

zixuanweeei commented Aug 7, 2019

ZhennanQin commented Aug 8, 2019 • edited Loading

ZhennanQin commented Aug 8, 2019

wuxun-zhang commented Aug 8, 2019

pengzhao-intel commented Aug 8, 2019

pengzhao-intel commented Sep 5, 2019

ZhennanQin commented Aug 8, 2019 •

edited

Loading