Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Bug] Failed to evaluate gradient on samples with train_mode=False #16256

Closed
ZhiminPeng opened this issue Sep 23, 2019 · 16 comments
Closed

[Bug] Failed to evaluate gradient on samples with train_mode=False #16256

ZhiminPeng opened this issue Sep 23, 2019 · 16 comments
Assignees
Labels

Comments

@ZhiminPeng
Copy link

Description

I am working on using integrated gradient to interpret DL models. This method requires evaluating gradient on a few samples. I understand that when evaluating gradient, one should set train_mode = False to avoid behaviors from the Dropout layers. I was able to do so with feedforward networks, and CNNs. But while experimenting with LSTM, calling x.grad for the first time gives the error as shown in the Error Message section. Calling it for the second time returns a tensor with all zeros.

Environment info (Required)

----------Python Info----------
Version      : 3.7.3
Compiler     : Clang 10.0.0 (clang-1000.11.45.5)
Build        : ('default', 'Mar 27 2019 09:23:39')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /Users/zmpeng/Documents/software/venv/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /Users/zmpeng/Documents/software/venv/lib/python3.7/site-packages/mxnet
Commit Hash   : 75a9e187d00a8b7ebc71412a02ed0e3ae489d91f
Library      : ['/Users/zmpeng/Documents/software/venv/lib/python3.7/site-packages/mxnet/libmxnet.so']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✖ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
Platform     : Darwin-17.7.0-x86_64-i386-64bit
system       : Darwin
node         : 38f9d34de030.ant.amazon.com
release      : 17.7.0
version      : Darwin Kernel Version 17.7.0: Sun Jun  2 20:31:42 PDT 2019; root:xnu-4570.71.46~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT MD_CLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0301 sec, LOAD: 0.7691 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0391 sec, LOAD: 0.2572 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0349 sec, LOAD: 0.1725 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0557 sec, LOAD: 0.1688 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0312 sec, LOAD: 0.5998 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0427 sec, LOAD: 0.1279 sec.
----------Environment----------
KMP_DUPLICATE_LIB_OK="True"
MXNET_CPU_WORKER_NTHREADS="1"
OMP_NUM_THREADS="1"

Package used (Python/R/Scala/Julia):
I'm using Python

Error Message:

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
~/Documents/software/venv/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/Documents/software/venv/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    400                         if cls is not object \
    401                                 and callable(cls.__dict__.get('__repr__')):
--> 402                             return _repr_pprint(obj, self, cycle)
    403 
    404             return _default_pprint(obj, self, cycle)

~/Documents/software/venv/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    695     """A pprint that just redirects to the normal repr function."""
    696     # Find newlines and replace them with p.break_()
--> 697     output = repr(obj)
    698     for idx,output_line in enumerate(output.splitlines()):
    699         if idx:

~/Documents/software/venv/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py in __repr__(self)
    192         """Returns a string representation of the array."""
    193         shape_info = 'x'.join(['%d' % x for x in self.shape])
--> 194         return '\n%s\n<%s %s @%s>' % (str(self.asnumpy()),
    195                                       self.__class__.__name__,
    196                                       shape_info, self.context)

~/Documents/software/venv/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py in asnumpy(self)
   1994             self.handle,
   1995             data.ctypes.data_as(ctypes.c_void_p),
-> 1996             ctypes.c_size_t(data.size)))
   1997         return data
   1998 

~/Documents/software/venv/lib/python3.7/site-packages/mxnet/base.py in check_call(ret)
    251     """
    252     if ret != 0:
--> 253         raise MXNetError(py_str(_LIB.MXGetLastError()))
    254 
    255 

MXNetError: [16:10:34] src/operator/./rnn-inl.h:1180: Check forward init error
Stack trace:
  [bt] (0) 1   libmxnet.so                         0x0000000117fbc929 mxnet::op::NDArrayOpProp::~NDArrayOpProp() + 4473
  [bt] (1) 2   libmxnet.so                         0x0000000117fbbd19 mxnet::op::NDArrayOpProp::~NDArrayOpProp() + 1385
  [bt] (2) 3   libmxnet.so                         0x0000000119aa475e void mxnet::op::RegressionBackwardCSRImpl<mshadow::cpu, mxnet::op::mshadow_op::minus>(mshadow::Stream<mshadow::cpu>*, mxnet::op::RegressionOutputParam const&, mxnet::OpReqType, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::NDArray const&) + 327422
  [bt] (3) 4   libmxnet.so                         0x0000000119a616f7 void mxnet::op::RegressionBackwardCSRImpl<mshadow::cpu, mxnet::op::mshadow_op::minus>(mshadow::Stream<mshadow::cpu>*, mxnet::op::RegressionOutputParam const&, mxnet::OpReqType, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::NDArray const&) + 52887
  [bt] (4) 5   libmxnet.so                         0x00000001195490eb mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::Resource, std::__1::allocator<mxnet::Resource> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::'lambda0'(mxnet::RunContext, mxnet::engine::CallbackOnComplete)::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const + 747
  [bt] (5) 6   libmxnet.so                         0x000000011954a902 std::__1::__function::__func<mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::Resource, std::__1::allocator<mxnet::Resource> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::'lambda0'(mxnet::RunContext), std::__1::allocator<mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::Resource, std::__1::allocator<mxnet::Resource> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::'lambda0'(mxnet::RunContext)>, void (mxnet::RunContext)>::operator()(mxnet::RunContext&&) + 66
  [bt] (6) 7   libmxnet.so                         0x00000001194bb3ab std::__1::enable_if<(__is_forward_iterator<mxnet::NDArray**>::value) && (is_constructible<mxnet::NDArray*, std::__1::iterator_traits<mxnet::NDArray**>::reference>::value), void>::type std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> >::assign<mxnet::NDArray**>(mxnet::NDArray**, mxnet::NDArray**) + 21307
  [bt] (7) 8   libmxnet.so                         0x00000001194bfdb1 std::__1::enable_if<(__is_forward_iterator<mxnet::NDArray**>::value) && (is_constructible<mxnet::NDArray*, std::__1::iterator_traits<mxnet::NDArray**>::reference>::value), void>::type std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> >::assign<mxnet::NDArray**>(mxnet::NDArray**, mxnet::NDArray**) + 40257
  [bt] (8) 9   libmxnet.so                         0x00000001194c30e2 std::__1::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> > mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::Get<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::'lambda2'()>(int, mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::'lambda2'()) + 2258

Minimum reproducible example

import mxnet as mx
from mxnet.gluon import nn, rnn
from mxnet import nd, gluon, autograd
context = mx.cpu()

class MyLSTM(gluon.Block):
    def __init__(self, hidden_dim, input_size, max_seq_len, 
                 num_recurent_layers=1, dropout=0., **kwargs):
        super(MyLSTM, self).__init__(**kwargs)
        self.lstm = rnn.LSTM(
            hidden_size=hidden_dim,
            num_layers=num_recurent_layers,
            dropout=dropout,
            bidirectional=False,
            input_size=input_size,
        )
        self.maxpool = nn.MaxPool1D(pool_size=max_seq_len)
        self.hidden2label = nn.Dense(
            2, 
            in_units=hidden_dim,  
            use_bias=True
        )
        
    def forward(self, input_data, hidden):
        output_lstm, hidden = self.lstm(input_data, hidden)
        output_tanh = nd.Activation(output_lstm, "tanh")
        output_tanh = nd.transpose(output_tanh, axes=(1, 2, 0))
        output_maxpool = self.maxpool(output_tanh)
        output_maxpool = nd.flatten(output_maxpool)
        label = self.hidden2label(output_maxpool)
        return label
    
    def begin_state(self, func, input_data, ctx=context, **kwargs):
        return self.lstm.begin_state(input_data.shape[1], func, ctx=ctx, **kwargs)

mx.random.seed(1)
max_seq_len = 10
input_size = 6
hidden_dim = 4

model = MyLSTM(hidden_dim, input_size, max_seq_len, dropout=0.2)
model.collect_params().initialize(mx.init.Xavier(), ctx=context)

x = nd.random_normal(0, 1, shape=(max_seq_len, 1, input_size))
x.attach_grad()
train_mode = False
hidden = model.begin_state(func=mx.nd.zeros, input_data=x, ctx=context)
with autograd.record(train_mode=train_mode):
    output = model(x, hidden)
    target = output[0][1]
    target.backward(train_mode=train_mode)

x.grad

Steps to reproduce

Just run the pasted Python code

What have you tried to solve it?

  1. set train_mode to True seems to produce the right gradient even with dropout being applied. However, setting train_mode = True produces wrong gradient for models with a Dropout layer.
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Bug

@szha szha added the Bug label Sep 25, 2019
@szha
Copy link
Member

szha commented Sep 25, 2019

@TaoLv could you take a look?

@TaoLv
Copy link
Member

TaoLv commented Sep 25, 2019

I can reproduce the issue by changing the last line to print(x.grad) and the issue exists for both mxnet and mxnet-mkl. Just talked with @zixuanweeei, he will help to take a look.

@zixuanweeei
Copy link
Contributor

zixuanweeei commented Sep 25, 2019

Thanks for reporting this issue. Actually, we need a workspace to store the intermediate result of RNN variants, like the output of every gate and the state of every step, which are created only when is_train=True. These intermediate results are used in the gradients' calculation. When train_mode=False (is_train=False), no workspace will be created in Forward. So it will raise the error in Backward.

As to dropout, mxnet-mkl doesn't support it. But you can export MXNET_USE_MKLDNN_RNN=0 to force MXNet run into the naive CPU-RNN path, where dropout are enabled. If you don't set MXNET_USE_MKLDNN_RNN=0, it should be 1 by default, which means that MKL-DNN RNN path without dropout enabled will be executed.

For now, we don't have a good solution for your request. RNN operator has a different machnism than other operators. I will look for a solution. Any insights? @ZhiminPeng

@zixuanweeei
Copy link
Contributor

@ZhiminPeng Could you give us some details about your application scenario? If you really want it works, we provide a temporal fix 025a227. It would be highly appreciated if you can try it in your application. And feel free to tell us if there is any problem. Thanks.

@zixuanweeei
Copy link
Contributor

zixuanweeei commented Sep 26, 2019

As far as I know, fused RNN operator needs a permanent workspace for storing intermediate results, which are used to calculate the gradients in Backward. In a inference only route, we don't intend to store those results, which is partly a matter of performance. Have a talk with @TaoLv offline, seems pooling operator also have some possibly relevant problems.

In autograd.record scenario, do we have some machanisms where some operators could store the intermediate results for some uses in later, like gradients' calculation, or just force them to do forward training path? @szha

@ZhiminPeng
Copy link
Author

@zixuanweeei Thanks for looking into this. I would love to give your temporal fix a try. I installed mxnet through pip. I wonder how I should pick up your change. My application scenario is for model interpretation through integrated gradient. This requires us to evaluate the gradient of the model on a few samples.

@zixuanweeei
Copy link
Contributor

zixuanweeei commented Sep 27, 2019

A nice paper! Maybe we can talk about it in the future. I am not familiar with some "axiom"s yet 😄 . As to get our change work, I think you should build it from source. It may contains some of the following steps,

  • Prepare the source code
git clone --recursive https://github.com/apache/incubator-mxnet.git && cd incubator-mxnet
git fetch https://github.com/zixuanweeei/incubator-mxnet.git rnn/force-forward-training
git cherry-pick 025a22790773cbd2dede80fc05c61ea9f3896e9d
# for unix-like env
cd your/path/to/incubator-mxnet
export PYTHONPATH=$PWD/python:$PYTHONPATH
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
  • Check if it is at present if you set the path variables
python -c "import mxnet; mxnet"

You will get a path pointing at a subdirectory of the root directory of incubator-mxnet.

@ZhiminPeng
Copy link
Author

The fix works

@szha
Copy link
Member

szha commented Oct 1, 2019

@zixuanweeei the train_mode concept is different from whether it's "an inference only route". It is for controlling the different behaviors between training mode and inference mode (e.g. dropout behaves as identity during inference). The code should rely on whether the gradient is being recorded instead.

@zixuanweeei
Copy link
Contributor

@ZhiminPeng Thank you for your trying. It should be noticed that the fix is just temporal, and it may lose some performance.

@zixuanweeei
Copy link
Contributor

@szha Thanks for your reply. I will delve into the concept of train_mode. And thanks for the hint about recording gradient. It may become a breakthrough solving the problem.

@ZhiminPeng
Copy link
Author

Do we have a timeline to get the correct fix merged? Our team is currently blocked by this.

@samskalicky
Copy link
Contributor

@zachgk assign [@szha ]

@zixuanweeei
Copy link
Contributor

Do we have a timeline to get the correct fix merged? Our team is currently blocked by this.

Sorry for the late update. For now, we are focusing on MXNet 1.6 upgrade. It may be fixed after 1.6.

@zixuanweeei
Copy link
Contributor

@ZhiminPeng We just tried to solve the problem in PR #16657. I hope it will solve the issue.

@szha szha closed this as completed Nov 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants