Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[bug] Autograd throws an exception that was not caught in MXNet 1.6 #18789

Open
yzh119 opened this issue Jul 25, 2020 · 17 comments
Open

[bug] Autograd throws an exception that was not caught in MXNet 1.6 #18789

yzh119 opened this issue Jul 25, 2020 · 17 comments

Comments

@yzh119
Copy link
Member

yzh119 commented Jul 25, 2020

Description

The Autograd module throws an exception that was not caught:

Error in sys.excepthook:

Original exception was:

After the execution of the program.

To Reproduce

Below is a minimal example to reproduce the bug:

from mxnet import nd
import mxnet as mx
x = mx.np.zeros((10,))
x.attach_grad()

class Op(mx.autograd.Function):
    def forward(self, x):
        out = x + 1
        return out

    def backward(self, grad):
        grad_x = grad
        return grad_x

op = Op()
with mx.autograd.record():
    y = op(x)
    y.sum().backward()

print(x.grad)

However, it could successfully print the result

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Error in sys.excepthook:

Original exception was:

Environment

----------Python Info----------
Version      : 3.7.6
Compiler     : GCC 7.3.0
Build        : ('default', 'Jan  8 2020 19:59:22')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /home/***/anaconda3/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/***/anaconda3/lib/python3.7/site-packages/mxnet
Num GPUs     : 1
Commit Hash   : 6eec9da55c5096079355d1f1a5fa58dcf35d6752
----------System Info----------
Platform     : Linux-5.7.8-200.fc32.x86_64-x86_64-with-fedora-32-Thirty_Two
system       : Linux
node         : LAPTOP-24KAE66Q.lan
release      : 5.7.8-200.fc32.x86_64
version      : #1 SMP Thu Jul 9 14:34:51 UTC 2020
@yzh119
Copy link
Member Author

yzh119 commented Jul 25, 2020

@sxjscience , @eric-haibin-lin do you have any idea on what happened? I met the same problem on an AWS p3 instance, and the error message persists after I upgrade my MXNet to 2.0 version.

@szha
Copy link
Member

szha commented Jul 25, 2020

I think the numpy array support in autograd.function was missed. #18790

@yzh119
Copy link
Member Author

yzh119 commented Jul 26, 2020

@szha Thanks for your help, but I think the problem also exists for mxnet.ndarray's, not only for mxnet.numpy.array.

@szha
Copy link
Member

szha commented Jul 26, 2020

Did you run the above as a script? Or in an interactive python shell? I want to see if the program terminated immediately after the execution.

@yzh119
Copy link
Member Author

yzh119 commented Jul 26, 2020

I ran the above code as a script.
If I ran it in interactive mode, the exception message appears as I exit:

>>> from mxnet import nd
>>> import mxnet as mx
>>> x = mx.np.zeros((10,))
>>> x.attach_grad()
>>> class Op(mx.autograd.Function):
...     def forward(self, x):
...         out = x + 1
...         return out
...     def backward(self, grad):
...         grad_x = grad
...         return grad_x
... 
>>> op = Op()
>>> with mx.autograd.record():
...     y = op(x)
...     y.sum().backward()
... 
>>> print(x.grad)
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
>>> 
>>> 
Error in sys.excepthook:

Original exception was:

@szha
Copy link
Member

szha commented Jul 26, 2020

#18768 probably have fixed it. Would you try the nightly build and see if this is still an issue?
pip install --pre mxnet-cu100 -f https://dist.mxnet.io/python

@yzh119
Copy link
Member Author

yzh119 commented Aug 5, 2020

@szha no, the problem still exists.

@szha
Copy link
Member

szha commented Aug 5, 2020

In [3]: from mxnet import nd
   ...: import mxnet as mx
   ...: x = mx.np.zeros((10,))
   ...: x.attach_grad()
   ...: mx.npx.set_np()
   ...: class Op(mx.autograd.Function):
   ...:     def forward(self, x):
   ...:         out = x + 1
   ...:         return out
   ...:
   ...:     def backward(self, grad):
   ...:         grad_x = grad
   ...:         return grad_x
   ...:
   ...: op = Op()
   ...: with mx.autograd.record():
   ...:     y = op(x)
   ...:     y.sum().backward()
   ...:
   ...: print(x.grad)
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

In [4]:
Do you really want to exit ([y]/n)?

@szha
Copy link
Member

szha commented Aug 5, 2020

mx.npx.set_np() seems to be missing

@sxjscience
Copy link
Member

Actually I cannot reproduce this error. @yzh119 Would you try again with the latest nightly version and with the following code snippet?

To install the nightly version:

# Install the version with CUDA 10.0
python3 -m pip install -U --pre "mxnet-cu100>=2.0.0b20200802" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20200802" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20200802" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20200802" -f https://dist.mxnet.io/python
from mxnet import nd
import mxnet as mx
mx.npx.set_np()
x = mx.np.zeros((10,))
x.attach_grad()

class Op(mx.autograd.Function):
    def forward(self, x):
        out = x + 1
        return out

    def backward(self, grad):
        grad_x = grad
        return grad_x

op = Op()
with mx.autograd.record():
    y = op(x)
    y.sum().backward()

print(x.grad)

@yzh119
Copy link
Member Author

yzh119 commented Aug 21, 2020

@sxjscience using the nightly build version does not work either:

[11:46:12] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Error in sys.excepthook:

Original exception was:

@sxjscience
Copy link
Member

sxjscience commented Aug 21, 2020 via email

@sxjscience
Copy link
Member

I confirmed that it won't happen if you just run it inside jupyter notebook and will only happen if you save to .py file and run the .py file.

@szha
Copy link
Member

szha commented Aug 21, 2020

@sxjscience sounds like a release order problem again. Will take a look

@szha
Copy link
Member

szha commented Aug 31, 2020

I didn't find a way to get access to any actual error and despite the message the program exited normally.
BTW I'm using OSX 10.15.6 which can trigger this error. @yzh119 @sxjscience are you also using Mac when observing this problem? I was chatting with @wkcn offline on this and it seems that this error doesn't happen on Linux.

@wkcn
Copy link
Member

wkcn commented Aug 31, 2020

I could not reproduce it when using MXNet (only-cpu) 1.6 and 2.0 on Arch Linux, even if running *.py file.

BTW, I used python 3.8.5.


I reproduced it on Ubuntu, Python 3.8.3, MXNet (only cpu) 1.6/2.0.

@szha
Copy link
Member

szha commented Aug 31, 2020

I was only able to reproduce the problem on mac on python3.7 and not on python3.8

~/mxnet  master ✗                                                                                                                                                                                      
▶ /usr/local/Cellar/[email protected]/3.7.8_1/bin/python3.7 test.py
[21:06:24] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Error in sys.excepthook:

Original exception was:

~/mxnet  master ✗                                                                                                                                                                                       
▶ /usr/local/Cellar/[email protected]/3.8.5/bin/python3.8 test.py
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants