[Bug] Failed to evaluate gradient on samples with train_mode=False #16256

ZhiminPeng · 2019-09-23T23:37:31Z

Description

I am working on using integrated gradient to interpret DL models. This method requires evaluating gradient on a few samples. I understand that when evaluating gradient, one should set train_mode = False to avoid behaviors from the Dropout layers. I was able to do so with feedforward networks, and CNNs. But while experimenting with LSTM, calling x.grad for the first time gives the error as shown in the Error Message section. Calling it for the second time returns a tensor with all zeros.

Environment info (Required)

----------Python Info----------
Version      : 3.7.3
Compiler     : Clang 10.0.0 (clang-1000.11.45.5)
Build        : ('default', 'Mar 27 2019 09:23:39')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /Users/zmpeng/Documents/software/venv/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /Users/zmpeng/Documents/software/venv/lib/python3.7/site-packages/mxnet
Commit Hash   : 75a9e187d00a8b7ebc71412a02ed0e3ae489d91f
Library      : ['/Users/zmpeng/Documents/software/venv/lib/python3.7/site-packages/mxnet/libmxnet.so']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✖ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
Platform     : Darwin-17.7.0-x86_64-i386-64bit
system       : Darwin
node         : 38f9d34de030.ant.amazon.com
release      : 17.7.0
version      : Darwin Kernel Version 17.7.0: Sun Jun  2 20:31:42 PDT 2019; root:xnu-4570.71.46~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT MD_CLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0301 sec, LOAD: 0.7691 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0391 sec, LOAD: 0.2572 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0349 sec, LOAD: 0.1725 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0557 sec, LOAD: 0.1688 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0312 sec, LOAD: 0.5998 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0427 sec, LOAD: 0.1279 sec.
----------Environment----------
KMP_DUPLICATE_LIB_OK="True"
MXNET_CPU_WORKER_NTHREADS="1"
OMP_NUM_THREADS="1"

Package used (Python/R/Scala/Julia):
I'm using Python

Error Message:

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
~/Documents/software/venv/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/Documents/software/venv/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    400                         if cls is not object \
    401                                 and callable(cls.__dict__.get('__repr__')):
--> 402                             return _repr_pprint(obj, self, cycle)
    403 
    404             return _default_pprint(obj, self, cycle)

~/Documents/software/venv/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    695     """A pprint that just redirects to the normal repr function."""
    696     # Find newlines and replace them with p.break_()
--> 697     output = repr(obj)
    698     for idx,output_line in enumerate(output.splitlines()):
    699         if idx:

~/Documents/software/venv/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py in __repr__(self)
    192         """Returns a string representation of the array."""
    193         shape_info = 'x'.join(['%d' % x for x in self.shape])
--> 194         return '\n%s\n<%s %s @%s>' % (str(self.asnumpy()),
    195                                       self.__class__.__name__,
    196                                       shape_info, self.context)

~/Documents/software/venv/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py in asnumpy(self)
   1994             self.handle,
   1995             data.ctypes.data_as(ctypes.c_void_p),
-> 1996             ctypes.c_size_t(data.size)))
   1997         return data
   1998 

~/Documents/software/venv/lib/python3.7/site-packages/mxnet/base.py in check_call(ret)
    251     """
    252     if ret != 0:
--> 253         raise MXNetError(py_str(_LIB.MXGetLastError()))
    254 
    255 

MXNetError: [16:10:34] src/operator/./rnn-inl.h:1180: Check forward init error
Stack trace:
  [bt] (0) 1   libmxnet.so                         0x0000000117fbc929 mxnet::op::NDArrayOpProp::~NDArrayOpProp() + 4473
  [bt] (1) 2   libmxnet.so                         0x0000000117fbbd19 mxnet::op::NDArrayOpProp::~NDArrayOpProp() + 1385
  [bt] (2) 3   libmxnet.so                         0x0000000119aa475e void mxnet::op::RegressionBackwardCSRImpl<mshadow::cpu, mxnet::op::mshadow_op::minus>(mshadow::Stream<mshadow::cpu>*, mxnet::op::RegressionOutputParam const&, mxnet::OpReqType, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::NDArray const&) + 327422
  [bt] (3) 4   libmxnet.so                         0x0000000119a616f7 void mxnet::op::RegressionBackwardCSRImpl<mshadow::cpu, mxnet::op::mshadow_op::minus>(mshadow::Stream<mshadow::cpu>*, mxnet::op::RegressionOutputParam const&, mxnet::OpReqType, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::NDArray const&) + 52887
  [bt] (4) 5   libmxnet.so                         0x00000001195490eb mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::Resource, std::__1::allocator<mxnet::Resource> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::'lambda0'(mxnet::RunContext, mxnet::engine::CallbackOnComplete)::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const + 747
  [bt] (5) 6   libmxnet.so                         0x000000011954a902 std::__1::__function::__func<mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::Resource, std::__1::allocator<mxnet::Resource> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::'lambda0'(mxnet::RunContext), std::__1::allocator<mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::Resource, std::__1::allocator<mxnet::Resource> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::'lambda0'(mxnet::RunContext)>, void (mxnet::RunContext)>::operator()(mxnet::RunContext&&) + 66
  [bt] (6) 7   libmxnet.so                         0x00000001194bb3ab std::__1::enable_if<(__is_forward_iterator<mxnet::NDArray**>::value) && (is_constructible<mxnet::NDArray*, std::__1::iterator_traits<mxnet::NDArray**>::reference>::value), void>::type std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> >::assign<mxnet::NDArray**>(mxnet::NDArray**, mxnet::NDArray**) + 21307
  [bt] (7) 8   libmxnet.so                         0x00000001194bfdb1 std::__1::enable_if<(__is_forward_iterator<mxnet::NDArray**>::value) && (is_constructible<mxnet::NDArray*, std::__1::iterator_traits<mxnet::NDArray**>::reference>::value), void>::type std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> >::assign<mxnet::NDArray**>(mxnet::NDArray**, mxnet::NDArray**) + 40257
  [bt] (8) 9   libmxnet.so                         0x00000001194c30e2 std::__1::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::'lambda2'()>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::'lambda2'()) + 2258

Minimum reproducible example

import mxnet as mx
from mxnet.gluon import nn, rnn
from mxnet import nd, gluon, autograd
context = mx.cpu()

class MyLSTM(gluon.Block):
    def __init__(self, hidden_dim, input_size, max_seq_len, 
                 num_recurent_layers=1, dropout=0., **kwargs):
        super(MyLSTM, self).__init__(**kwargs)
        self.lstm = rnn.LSTM(
            hidden_size=hidden_dim,
            num_layers=num_recurent_layers,
            dropout=dropout,
            bidirectional=False,
            input_size=input_size,
        )
        self.maxpool = nn.MaxPool1D(pool_size=max_seq_len)
        self.hidden2label = nn.Dense(
            2, 
            in_units=hidden_dim,  
            use_bias=True
        )
        
    def forward(self, input_data, hidden):
        output_lstm, hidden = self.lstm(input_data, hidden)
        output_tanh = nd.Activation(output_lstm, "tanh")
        output_tanh = nd.transpose(output_tanh, axes=(1, 2, 0))
        output_maxpool = self.maxpool(output_tanh)
        output_maxpool = nd.flatten(output_maxpool)
        label = self.hidden2label(output_maxpool)
        return label
    
    def begin_state(self, func, input_data, ctx=context, **kwargs):
        return self.lstm.begin_state(input_data.shape[1], func, ctx=ctx, **kwargs)

mx.random.seed(1)
max_seq_len = 10
input_size = 6
hidden_dim = 4

model = MyLSTM(hidden_dim, input_size, max_seq_len, dropout=0.2)
model.collect_params().initialize(mx.init.Xavier(), ctx=context)

x = nd.random_normal(0, 1, shape=(max_seq_len, 1, input_size))
x.attach_grad()
train_mode = False
hidden = model.begin_state(func=mx.nd.zeros, input_data=x, ctx=context)
with autograd.record(train_mode=train_mode):
    output = model(x, hidden)
    target = output[0][1]
    target.backward(train_mode=train_mode)

x.grad

Steps to reproduce

Just run the pasted Python code

What have you tried to solve it?

set train_mode to True seems to produce the right gradient even with dropout being applied. However, setting train_mode = True produces wrong gradient for models with a Dropout layer.

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-09-23T23:37:34Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Bug

szha · 2019-09-25T07:17:36Z

@TaoLv could you take a look?

TaoLv · 2019-09-25T07:39:05Z

I can reproduce the issue by changing the last line to print(x.grad) and the issue exists for both mxnet and mxnet-mkl. Just talked with @zixuanweeei, he will help to take a look.

zixuanweeei · 2019-09-25T09:01:51Z

Thanks for reporting this issue. Actually, we need a workspace to store the intermediate result of RNN variants, like the output of every gate and the state of every step, which are created only when is_train=True. These intermediate results are used in the gradients' calculation. When train_mode=False (is_train=False), no workspace will be created in Forward. So it will raise the error in Backward.

As to dropout, mxnet-mkl doesn't support it. But you can export MXNET_USE_MKLDNN_RNN=0 to force MXNet run into the naive CPU-RNN path, where dropout are enabled. If you don't set MXNET_USE_MKLDNN_RNN=0, it should be 1 by default, which means that MKL-DNN RNN path without dropout enabled will be executed.

For now, we don't have a good solution for your request. RNN operator has a different machnism than other operators. I will look for a solution. Any insights? @ZhiminPeng

zixuanweeei · 2019-09-26T11:06:33Z

@ZhiminPeng Could you give us some details about your application scenario? If you really want it works, we provide a temporal fix 025a227. It would be highly appreciated if you can try it in your application. And feel free to tell us if there is any problem. Thanks.

zixuanweeei · 2019-09-26T12:56:50Z

As far as I know, fused RNN operator needs a permanent workspace for storing intermediate results, which are used to calculate the gradients in Backward. In a inference only route, we don't intend to store those results, which is partly a matter of performance. Have a talk with @TaoLv offline, seems pooling operator also have some possibly relevant problems.

In autograd.record scenario, do we have some machanisms where some operators could store the intermediate results for some uses in later, like gradients' calculation, or just force them to do forward training path? @szha

ZhiminPeng · 2019-09-26T21:52:13Z

@zixuanweeei Thanks for looking into this. I would love to give your temporal fix a try. I installed mxnet through pip. I wonder how I should pick up your change. My application scenario is for model interpretation through integrated gradient. This requires us to evaluate the gradient of the model on a few samples.

zixuanweeei · 2019-09-27T01:33:46Z

A nice paper! Maybe we can talk about it in the future. I am not familiar with some "axiom"s yet 😄 . As to get our change work, I think you should build it from source. It may contains some of the following steps,

Prepare the source code

git clone --recursive https://github.com/apache/incubator-mxnet.git && cd incubator-mxnet
git fetch https://github.com/zixuanweeei/incubator-mxnet.git rnn/force-forward-training
git cherry-pick 025a22790773cbd2dede80fc05c61ea9f3896e9d

Build it from source. It depends on your platform. And you can find some useful scripts from https://github.com/apache/incubator-mxnet/tree/master/setup-utils. This process may contain some problems. It would be highly appreciate if you could share any expericence in your building.
Build pip wheel or configurate path variables

# for unix-like env
cd your/path/to/incubator-mxnet
export PYTHONPATH=$PWD/python:$PYTHONPATH
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH

Check if it is at present if you set the path variables

python -c "import mxnet; mxnet"

You will get a path pointing at a subdirectory of the root directory of incubator-mxnet.

ZhiminPeng · 2019-10-01T18:12:13Z

The fix works

szha · 2019-10-01T21:20:55Z

@zixuanweeei the train_mode concept is different from whether it's "an inference only route". It is for controlling the different behaviors between training mode and inference mode (e.g. dropout behaves as identity during inference). The code should rely on whether the gradient is being recorded instead.

zixuanweeei · 2019-10-03T01:19:16Z

@ZhiminPeng Thank you for your trying. It should be noticed that the fix is just temporal, and it may lose some performance.

zixuanweeei · 2019-10-03T01:33:09Z

@szha Thanks for your reply. I will delve into the concept of train_mode. And thanks for the hint about recording gradient. It may become a breakthrough solving the problem.

ZhiminPeng · 2019-10-04T17:28:31Z

Do we have a timeline to get the correct fix merged? Our team is currently blocked by this.

samskalicky · 2019-10-07T19:01:54Z

@zachgk assign [@szha ]

zixuanweeei · 2019-10-10T01:08:45Z

Do we have a timeline to get the correct fix merged? Our team is currently blocked by this.

Sorry for the late update. For now, we are focusing on MXNet 1.6 upgrade. It may be fixed after 1.6.

zixuanweeei · 2019-10-28T09:02:05Z

@ZhiminPeng We just tried to solve the problem in PR #16657. I hope it will solve the issue.

szha added the Bug label Sep 25, 2019

zachgk assigned szha Oct 7, 2019

zixuanweeei mentioned this issue Oct 28, 2019

[mkldnn-v1.0] Adopt autograd.record() context to RNNOp #16657

Merged

3 tasks

szha closed this as completed Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Failed to evaluate gradient on samples with train_mode=False #16256

[Bug] Failed to evaluate gradient on samples with train_mode=False #16256

ZhiminPeng commented Sep 23, 2019

mxnet-label-bot commented Sep 23, 2019

szha commented Sep 25, 2019

TaoLv commented Sep 25, 2019

zixuanweeei commented Sep 25, 2019 •

edited

Loading

zixuanweeei commented Sep 26, 2019

zixuanweeei commented Sep 26, 2019 •

edited

Loading

ZhiminPeng commented Sep 26, 2019

zixuanweeei commented Sep 27, 2019 •

edited

Loading

ZhiminPeng commented Oct 1, 2019

szha commented Oct 1, 2019

zixuanweeei commented Oct 3, 2019

zixuanweeei commented Oct 3, 2019

ZhiminPeng commented Oct 4, 2019

samskalicky commented Oct 7, 2019

zixuanweeei commented Oct 10, 2019

zixuanweeei commented Oct 28, 2019

[Bug] Failed to evaluate gradient on samples with train_mode=False #16256

[Bug] Failed to evaluate gradient on samples with train_mode=False #16256

Comments

ZhiminPeng commented Sep 23, 2019

Description

Environment info (Required)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

mxnet-label-bot commented Sep 23, 2019

szha commented Sep 25, 2019

TaoLv commented Sep 25, 2019

zixuanweeei commented Sep 25, 2019 • edited Loading

zixuanweeei commented Sep 26, 2019

zixuanweeei commented Sep 26, 2019 • edited Loading

ZhiminPeng commented Sep 26, 2019

zixuanweeei commented Sep 27, 2019 • edited Loading

ZhiminPeng commented Oct 1, 2019

szha commented Oct 1, 2019

zixuanweeei commented Oct 3, 2019

zixuanweeei commented Oct 3, 2019

ZhiminPeng commented Oct 4, 2019

samskalicky commented Oct 7, 2019

zixuanweeei commented Oct 10, 2019

zixuanweeei commented Oct 28, 2019

zixuanweeei commented Sep 25, 2019 •

edited

Loading

zixuanweeei commented Sep 26, 2019 •

edited

Loading

zixuanweeei commented Sep 27, 2019 •

edited

Loading