Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNet 1.5.0.rc2] Issues with asnumpy() method #15431

Closed
Wallart opened this issue Jul 2, 2019 · 16 comments
Closed

[MXNet 1.5.0.rc2] Issues with asnumpy() method #15431

Wallart opened this issue Jul 2, 2019 · 16 comments
Labels

Comments

@Wallart
Copy link

Wallart commented Jul 2, 2019

Hello,

I've decided to try MXNet 1.5.0.rc2.
And I have a lot of crashes due to asnumpy() calls like the following one :

phase = nd.array(np.arctan2(imag_part.asnumpy(), real_part.asnumpy()))

phase = nd.array(np.arctan2(imag_part.asnumpy(), real_part.asnumpy()))
  File "/opt/miniconda3/envs/intelpython3/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/opt/miniconda3/envs/intelpython3/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: std::exception

The same issue occurs if I store asnumpy() result in a temporary variable. The crash seems random.

I didn't had any problems in MXNet 1.4.1

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@Wallart Wallart changed the title [MXNet 1.5] Issues with asnumpy() method [MXNet 1.5.0.rc2] Issues with asnumpy() method Jul 2, 2019
@roywei
Copy link
Member

roywei commented Jul 2, 2019

@Wallart Hi, thanks for this issue!
could you provide the following to help us reproduce the error?

  1. build flags or pip versions
  2. machine type and device context(CPU GPU)
  3. values for imag_part and real_part?

running the following commands works fine for me on cpu with mxnet mkldnn

>>> import mxnet as mx
>>> import numpy as np
>>> x = mx.nd.array([-1, +1, +1, -1])
>>> y = mx.nd.array([-1, -1, +1, +1])
>>> phase = mx.nd.array(np.arctan2(x.asnumpy(), y.asnumpy()))
>>> phase
[-2.3561945  2.3561945  0.7853982 -0.7853982]
<NDArray 4 @cpu(0)>

@Wallart
Copy link
Author

Wallart commented Jul 2, 2019

  1. Here are my build flags https://pastebin.com/bwRX0tXN
    The only difference with my older builds (1.3 / 1.4 / etc.) is the USE_GPERFTOOLS set to false.

  2. I am running the code on a machine equipped with :

  • Intel i7 7700k
  • 16 Gb of RAM
  • 2 GTX 1080 Ti
    I am using Intel Python 3.6.8 with all MKL/MKLDNN libs. Everything is installed through miniconda, inside a Docker container.
  1. In fact I am trying to implement a tacotron2 and the issue occurs in the dataloader, the tensors are still on cpu. Their shape is (1, 513, 533) and they are full of zeros.

I have the same results as you both on cpu and gpu(0)

@Wallart
Copy link
Author

Wallart commented Jul 2, 2019

I just discovered that crashes are not random. When I put a breakpoint on faulty lines there is no exceptions when I resume the code execution.

Same error with that type of instructions. Line is passing when a breakpoint is used
return nd.log(nd.clip(x, a_min=clip_val, a_max=x.max().asscalar())) * c

return nd.log(nd.clip(x, a_min=clip_val, a_max=x.max().asscalar())) * c
  File "/opt/miniconda3/envs/intelpython3/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/ndarray/ndarray.py", line 2014, in asscalar
    return self.asnumpy()[0]
  File "/opt/miniconda3/envs/intelpython3/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/opt/miniconda3/envs/intelpython3/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: std::exception

@roywei
Copy link
Member

roywei commented Jul 3, 2019

Hi @Wallart running the two faulty lines you provided alone does not throw any errors.
Are these lines all in your dataloader code? Could you provide a minimum reproduciable code?

Maybe the probelms is when running above code with dataloader & multiple workers? I'm still not able to reprdudce your error with your two lines

@roywei
Copy link
Member

roywei commented Jul 9, 2019

Hi @Wallart are you able to come up with a reproduciable example? I suspect it may only occur when using dataloader with multi-workers, but i can't reproduce the error. You can also turn on DEBUG flag to get more information on the error.

@Wallart
Copy link
Author

Wallart commented Jul 10, 2019

@roywei I put the source code on Github.
The error occurs before the dataloader. If you use the dataset to plot a sample for example, you can reproduce.
https://github.com/Wallart/gluon-tacotron2/blob/master/dataset/wav_dataset.py

@frankfliu
Copy link
Contributor

@mxnet-label-bot add [Bug]

@marcoabreu marcoabreu added the Bug label Jul 10, 2019
@roywei
Copy link
Member

roywei commented Jul 10, 2019

@Wallart your code is running fine on my machine. I'm testing on LJ-Speech Dataset

added a few lines in main to format the text files, and everything is plotted, and asnumpy() works fine. Could you try without Docker?

if __name__ == '__main__':
    logging.basicConfig()
    logging.getLogger().setLevel(logging.INFO)

    params = {
        'max_wav_value': 32768.0,  # for 16 bits files
        'sampling_rate': 22050,
        'filter_length': 1024,
        'hop_length': 256,
        'win_length': 1024,
        'n_mel_channels': 80,
        'mel_fmin': 0.0,
        'mel_fmax': 8000.0
    }
    with open('~/Downloads/LJSpeech-1.1/metadata.csv', encoding='utf-8') as f:
        for line in f:
            record = line.split('|')
            file_name = record[0] + '.txt'
            content = record[2]
            with open('~/Downloads/LJSpeech-1.1/wavs/' + file_name, 'w', encoding='utf-8') as text_output:
                text_output.write(content)

    french = WavDataset('~/Downloads/LJSpeech-1.1/wavs', text_to_sequence, **params)
    assert type(french[0]) == tuple

Screen Shot 2019-07-10 at 2 16 55 PM

@Wallart
Copy link
Author

Wallart commented Jul 24, 2019

@roywei I reproduced the exact same environment outside the Docker. Still using intelpython through miniconda, sames build flags, and the error is still here

EDIT : I uninstalled my custom build of mxnet and did a 'pip install mxnet==1.5.0' and now it's working (I don't know what flags are used for the pip release).
The newest releases of 1.5.0 might have fixed the problem or maybe my build flags are involved.

I will build the newer 1.5.0 to see if I can reproduce.

@Wallart
Copy link
Author

Wallart commented Jul 24, 2019

Same problem. I am building MXNet with something that makes it crash

EDIT : 'pip install mxnet-cu101-mkl==1.5.0' works. I will try to rebuild without LAPACK flag

@roywei
Copy link
Member

roywei commented Jul 24, 2019

You can find the build flags used for pip packages here: https://github.com/apache/incubator-mxnet/tree/master/make/pip

@Wallart
Copy link
Author

Wallart commented Aug 6, 2019

I finally found the issue. At runtime I was linking the outdated MKLDNN (v0.14) provided by Anaconda whereas MXNet is probably using v1.0+

@TaoLv
Copy link
Member

TaoLv commented Aug 6, 2019

@Wallart, the master branch uses v0.20 of MKL-DNN while the 1.5.0 release uses v0.19. Please use the self-contained MKL-DNN in MXNet. The anaconda MKL-DNN distribution is not actively maintained.

@Wallart Wallart closed this as completed Aug 6, 2019
@aGiant
Copy link

aGiant commented May 24, 2020

This issue came back if using mxnet 1.6 and softmax outputs

test p shape: (145, 49)
Traceback (most recent call last):
  File ".\data_process.py", line 69, in <module>
    print(p.asnumpy())
  File "C:\vnstudio\lib\site-packages\mxnet\ndarray\ndarray.py", line 2535, in asnumpy
    ctypes.c_size_t(data.size)))
  File "C:\vnstudio\lib\site-packages\mxnet\base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [10:54:17] c:\jenkins\workspace\mxnet-tag\mxnet\3rdparty\mshadow\mshadow\./extension/reshape.h:32: Check failed: ishape.Size() == shape.Size() (870 vs. 7105) : reshape size must match

@roywei
Copy link
Member

roywei commented May 24, 2020

Hi @aGiant , I think your issue is a different problem. Looking at the stack trace it says (870 vs. 7105) : reshape size must match, could you check your input and how you are reshaping it?
If you believe it's a bug, please open a new issue.
Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants