-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Bug] Failed to evaluate gradient on samples with train_mode=False #16256
Comments
Hey, this is the MXNet Label Bot. |
@TaoLv could you take a look? |
I can reproduce the issue by changing the last line to |
Thanks for reporting this issue. Actually, we need a workspace to store the intermediate result of RNN variants, like the output of every gate and the state of every step, which are created only when As to dropout, mxnet-mkl doesn't support it. But you can For now, we don't have a good solution for your request. RNN operator has a different machnism than other operators. I will look for a solution. Any insights? @ZhiminPeng |
@ZhiminPeng Could you give us some details about your application scenario? If you really want it works, we provide a temporal fix 025a227. It would be highly appreciated if you can try it in your application. And feel free to tell us if there is any problem. Thanks. |
As far as I know, fused RNN operator needs a permanent workspace for storing intermediate results, which are used to calculate the gradients in In |
@zixuanweeei Thanks for looking into this. I would love to give your temporal fix a try. I installed mxnet through pip. I wonder how I should pick up your change. My application scenario is for model interpretation through integrated gradient. This requires us to evaluate the gradient of the model on a few samples. |
A nice paper! Maybe we can talk about it in the future. I am not familiar with some "axiom"s yet 😄 . As to get our change work, I think you should build it from source. It may contains some of the following steps,
You will get a path pointing at a subdirectory of the root directory of incubator-mxnet. |
The fix works |
@zixuanweeei the |
@ZhiminPeng Thank you for your trying. It should be noticed that the fix is just temporal, and it may lose some performance. |
@szha Thanks for your reply. I will delve into the concept of |
Do we have a timeline to get the correct fix merged? Our team is currently blocked by this. |
Sorry for the late update. For now, we are focusing on MXNet 1.6 upgrade. It may be fixed after 1.6. |
@ZhiminPeng We just tried to solve the problem in PR #16657. I hope it will solve the issue. |
Description
I am working on using integrated gradient to interpret DL models. This method requires evaluating gradient on a few samples. I understand that when evaluating gradient, one should set
train_mode = False
to avoid behaviors from the Dropout layers. I was able to do so with feedforward networks, and CNNs. But while experimenting with LSTM, callingx.grad
for the first time gives the error as shown in the Error Message section. Calling it for the second time returns a tensor with all zeros.Environment info (Required)
Package used (Python/R/Scala/Julia):
I'm using Python
Error Message:
Minimum reproducible example
Steps to reproduce
Just run the pasted Python code
What have you tried to solve it?
train_mode = True
produces wrong gradient for models with aDropout
layer.The text was updated successfully, but these errors were encountered: