Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

FullyConnected op with float64 and MKL-DNN fails if gradient are not set in a specific way #15767

Closed
matteosal opened this issue Aug 6, 2019 · 13 comments · Fixed by #15853
Closed

Comments

@matteosal
Copy link
Contributor

Description

With MKL-DNN and float64 arrays, getting the output of a FullyConnected op after a forward pass fails unless the gradient update method is not 'null' and explicit gradient arrays are specified (even though no backward pass is involved).

Environment info (Required)

----------Python Info----------
Version      : 3.7.2
Compiler     : GCC 7.3.0
Build        : ('default', 'Dec 29 2018 06:19:36')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.1
Directory    : /opt/Anaconda/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /home/matteo/Git/mxnet/python/mxnet
Commit hash file "/home/matteo/Git/mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✔ F16C
✔ JEMALLOC
✖ BLAS_OPEN
✔ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✖ LAPACK
✔ MKLDNN
✖ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✖ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
Platform     : Linux-4.15.0-55-generic-x86_64-with-debian-buster-sid
system       : Linux
node         : mongolius
release      : 4.15.0-55-generic
version      : #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Stepping:            3
CPU MHz:             2700.253
CPU max MHz:         3500,0000
CPU min MHz:         800,0000
BogoMIPS:            5184.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0117 sec, LOAD: 0.8935 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0599 sec, LOAD: 2.1901 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1028 sec, LOAD: 0.9832 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0657 sec, LOAD: 1.2597 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0380 sec, LOAD: 0.8543 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0395 sec, LOAD: 0.4625 sec.

Package used: python

Build info (Required if built from source)

Compiler: gcc

MXNet commit hash: 3255d87

Build config: plain config.mk with USE_OPENCV=0

Error Message:

Traceback (most recent call last):
  File "script.py", line 30, in <module>
    print(ex.outputs[0])
  File "/home/matteo/Git/mxnet/python/mxnet/ndarray/ndarray.py", line 194, in __repr__
    return '\n%s\n<%s %s @%s>' % (str(self.asnumpy()),
  File "/home/matteo/Git/mxnet/python/mxnet/ndarray/ndarray.py", line 2096, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/matteo/Git/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:46:52] src/operator/subgraph/mkldnn/../.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h:217: unknown type for MKLDNN
Stack trace:
  [bt] (0) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7fdeee8a4fc3]
  [bt] (1) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::SgMKLDNNFCOp::Forward(mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x902) [0x7fdeee92ece2]
  [bt] (2) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::StatefulComputeExExecutor::Run(mxnet::RunContext, bool)+0x2d1) [0x7fdef0e0ec81]
  [bt] (3) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2b00ead) [0x7fdef0dcaead]
  [bt] (4) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2b0103f) [0x7fdef0dcb03f]
  [bt] (5) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x585) [0x7fdef16810a5]
  [bt] (6) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x147) [0x7fdef16944d7]
  [bt] (7) /home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()+0x4e) [0x7fdef167f7ce]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fdee656c66f]

Minimum reproducible example

import mxnet as mx

sym = mx.sym.FullyConnected(
	mx.sym.Variable('in'), 
	mx.sym.Variable('w'), 
	mx.sym.Variable('b'), 
	num_hidden=2
)

dtype = 'float64'
explicit_grad = {
	'in': mx.nd.array([[2, 3, 4]], dtype=dtype),
	'w': mx.nd.array([[1, 2, 3], [4, 5, 6]], dtype=dtype),
	'b': mx.nd.array([7, 8], dtype=dtype)
}

args_grad = explicit_grad
grad_req = 'write'

ex = sym.bind(mx.cpu(), 
	{
		'in': mx.nd.array([[2, 3, 4]], dtype = dtype),
		'w': mx.nd.array([[1, 2, 3], [4, 5, 6]], dtype = dtype),
		'b': mx.nd.array([7, 8], dtype = dtype)
	},
	args_grad = args_grad,
	grad_req = grad_req
)
ex.forward()
print(ex.outputs[0])

The above script works, but setting args_grad = None or grad_req = 'null' (or both) makes it fail with this error:

src/operator/subgraph/mkldnn/../.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h:217: unknown type for MKLDNN

Every combination used to work in commit 076b2f3

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@vdantu
Copy link
Contributor

vdantu commented Aug 6, 2019

@mxnet-label-bot add [Bug]

@pengzhao-intel
Copy link
Contributor

@wuxun-zhang please take a look for this bug.

@wuxun-zhang
Copy link
Contributor

@matteosal Thanks for reporting this issue. I can reproduce this issue locally. Firstly, float64 is not supported yet for current MKL-DNN implementation, so actually mkl-dnn pass should not be executed in this example and there should be missing or imcomplete datatype check somewhere. Additionally, grad_req is dependent on args_grad, so grad_req is always kNullOp when args_grad=None (see #L167).

@matteosal
Copy link
Contributor Author

I also get the same problem with RNN, but setting explicit gradients doesn't help in this case. It seems completely broken on float64:

import mxnet as mx

sym = mx.sym.RNN(
	mx.sym.Variable('in'), 
	mx.sym.Variable('par'), 
	mx.sym.Variable('s'), 
	state_size = (2),
	num_layers = 1,
	mode = 'rnn_tanh'
)

dtype = 'float64'
explicit_grad = {
	'in': mx.nd.ones([2, 1, 2], dtype=dtype),
	'par': mx.nd.ones([12], dtype=dtype),
	's': mx.nd.ones([1, 1, 2], dtype=dtype)
}

args_grad = explicit_grad
grad_req = 'write'

ex = sym.bind(mx.cpu(), 
	{
		'in': mx.nd.ones([2, 1, 2], dtype=dtype),
		'par': mx.nd.ones([12], dtype=dtype),
		's': mx.nd.ones([1, 1, 2], dtype=dtype)
	},
	args_grad = args_grad,
	grad_req = grad_req
)
ex.forward()
print(ex.outputs[0])

Other RNN modes besides 'rnn_tanh' are also affected.

@pengzhao-intel
Copy link
Contributor

@wuxun-zhang let's double-check all data type in MKLDNN backend. Maybe fix should be in 1.5.1. @TaoLv

@wuxun-zhang
Copy link
Contributor

Seems that there are no data type check for MKL-DNN stateful RNN implementation (see https://github.com/apache/incubator-mxnet/blob/master/src/operator/rnn.cc#L226). So, when input data is float64, mkldnn rnn pass still be executed and then unknown mkldnn type error will be got.

@zixuanweeei
Copy link
Contributor

Seems that there are no data type check for MKL-DNN stateful RNN implementation (see https://github.com/apache/incubator-mxnet/blob/master/src/operator/rnn.cc#L226). So, when input data is float64, mkldnn rnn pass still be executed and then unknown mkldnn type error will be got.

The execution trace of RNN is maked out as below.

https://github.com/apache/incubator-mxnet/blob/71861238743fcd8177afe52d1562d9078ac547de/src/operator/rnn.cc#L254

https://github.com/apache/incubator-mxnet/blob/71861238743fcd8177afe52d1562d9078ac547de/src/operator/nn/mkldnn/mkldnn_base-inl.h#L206-L220

@ZhennanQin
Copy link
Contributor

ZhennanQin commented Aug 8, 2019

It's not all about float64, but about MKLDNN subgraph backend. The problem is, recently we enabled MKLDNN subgraph backend by default on master, and this will break the fallback mechanism when handing float64. So for nightly build from master, please use export MXNET_SUBGRAPH_BACKEND=NONE to work around shortly, for MXNet v1.5.0, please unset MXNET_SUBGRAPH_BACKEND.

@ZhennanQin
Copy link
Contributor

@pengzhao-intel @TaoLv v1.5.0 doesn't have this issue. So don't need to fix in v1.5.1.

@wuxun-zhang
Copy link
Contributor

@ZhennanQin Can we add data type check here #L1663 to disable subgraph when input data type is not supported by MKL-DNN?

@pengzhao-intel
Copy link
Contributor

@pengzhao-intel @TaoLv v1.5.0 doesn't have this issue. So don't need to fix in v1.5.1.

It's nice and we can try to resolve in 1.6.

@pengzhao-intel
Copy link
Contributor

@matteosal sorry for the delay. The PR is blocked by 3rd party package but it is resolved and will be merged soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants