Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Converting MX array to DLPack crashes when MX array goes out-of-scope #13658

Closed
jermainewang opened this issue Dec 16, 2018 · 15 comments
Closed
Labels

Comments

@jermainewang
Copy link
Contributor

Description

Converting MX NDArray to DLPack, then to other framework's DLPack-compatible NDArray causes memory corruption when the origin MX NDArray goes out-of-scope.

Environment info (Required)

----------Python Info----------
Version      : 3.5.2
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Nov 12 2018 13:43:14')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.1
Directory    : /usr/local/lib/python3.5/dist-packages/pip
----------MXNet Info-----------
Version      : 1.4.0
Directory    : /usr/local/lib/python3.5/dist-packages/mxnet
Commit Hash   : 1f73c5d9d308a690b57ea1b474d2ba99ca06c476
----------System Info----------
Platform     : Linux-4.19.4-arch1-1-ARCH-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : 17d02f89890e
release      : 4.19.4-arch1-1-ARCH
version      : #1 SMP PREEMPT Fri Nov 23 09:06:58 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
Stepping:              4
CPU MHz:               1812.064
CPU max MHz:           3900.0000
CPU min MHz:           1200.0000
BogoMIPS:              7384.55
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0056 sec, LOAD: 0.4655 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0277 sec, LOAD: 0.4154 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0039 sec, LOAD: 0.1236 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1844 sec, LOAD: 1.0354 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0045 sec, LOAD: 0.0329 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0125 sec, LOAD: 0.6710 sec.

Package used (Python/R/Scala/Julia): Python

Error Message:

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1fef5a) [0x7f4a09186f5a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31383b6) [0x7f4a0c0c03b6]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f4a259324b0]
[bt] (3) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::TypeDefault::tensorFromBlob(void*, c10::ArrayRef<long>, c10::ArrayRef<long>, std::function<void (void*)> const&) const+0x61) [0x7f4996c4c741]
[bt] (4) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::fromDLPack(DLManagedTensor const*)+0x29f) [0x7f4996871e2f]
[bt] (5) /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so(THPModule_fromDLPack(_object*, _object*)+0x41) [0x7f49e2e2f341]
[bt] (6) python3(PyEval_EvalFrameEx+0x4d06) [0x53b486]
[bt] (7) python3(PyEval_EvalFrameEx+0x4b14) [0x53b294]
[bt] (8) python3() [0x53fc97]
[bt] (9) python3(PyEval_EvalCode+0x1f) [0x5409bf]

Minimum reproducible example

import mxnet as mx
from torch.utils import dlpack

def foo():
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dlpack.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.numpy()

Torch version v1.0.0

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Use a ubuntu 16.04 image (with mx and torch installed)
  2. Run the above code

What have you tried to solve it?

Found this bug in DGL project dmlc/dgl#312 . Tried:

  1. MXArray -> DLPack -> DGL Array : FAILED
  2. MXArray -> DLPack -> MXArray : SUCCEED
  3. MXArray -> DLPack -> Torch Tensor : FAILED
  4. Torch Tensor -> DLPack -> DGL Array : SUCCEED
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.

@zheng-da
Copy link
Contributor

@tqchen @wkcn could you please take a look at this problem? Thanks

@wkcn
Copy link
Member

wkcn commented Dec 17, 2018

DLPack allow strides field to be nullptr, for mxnet do do not need to store strides, as we only support compact tensor for now.
https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L409

However, PyTorch doesn't accept the DLPack whose strides is nullptr.

@zheng-da
Copy link
Contributor

the error might be trickier. For DGL code below, the code is undeterministic. We have to run it multiple times before we can see crash. It seems that some memory in dlpack exported from MXNet isn't referenced. However, if I use mx.nd.from_dlpack(dl), it works fine. @wkcn do you have any suggestions?

import os
os.environ['DGLBACKEND'] = 'mxnet'
import mxnet as mx
import numpy as np
import dgl

def foo():
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dgl.ndarray.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.asnumpy()

@jermainewang
Copy link
Contributor Author

@wkcn This explains the torch case thank you. In DGL, we actually handled this:
https://github.com/dmlc/dgl/blob/632d598c77af616278bc0f2144a14958678dcbae/src/runtime/ndarray.cc#L83-L89
In DGL, we also only support contiguous tensor, so stride_ is never used elsewhere. The NDArray class is borrowed from TVM project. I wonder whether this bug happens in TVM too.

@wkcn
Copy link
Member

wkcn commented Dec 17, 2018

@zheng-da mx.nd.from_dlpack can accept the DLPack whose strides is nullptr.
Could you please try the MXNet PR which fills the strides of DLPack?
The code you wrote works fine in mxnet==1.5.0b20181216.
I couldn't reproduce the error.

@wkcn
Copy link
Member

wkcn commented Dec 17, 2018

@jermainewang Let me test it for TVM.

import tvm
import mxnet as mx

tvm_a = tvm.ndarray.array([1, 2, 3])
tvm_pack = tvm_a.to_dlpack()

mx_a = mx.nd.from_dlpack(tvm_pack)
print(mx_a)

mx_b = mx.nd.array([4, 5, 6])
mx_pack = mx_b.to_dlpack_for_write()
tvm_b = tvm.nd.from_dlpack(mx_pack)
print(tvm_b)
mxnet==1.5.0b20181216
tvm: 0.5.dev

It works fine in TVM.

@jermainewang
Copy link
Contributor Author

@wkcn , try following steps:

  1. Use a ubuntu 16.04 image.
  2. Create file t.py with following codes:
import mxnet as mx
import numpy as np
import tvm

def foo():
  x = mx.nd.array([0, 5], dtype='int64')
  dl = x.to_dlpack_for_read()
  return tvm.nd.from_dlpack(dl)

for i in range(10):
  y = foo()
  y.asnumpy()
  1. Run it with for i in {0..100}; do echo $i && python3 t.py || break ; done

I used a ubuntu docker image and could reproduce the error.

@wkcn
Copy link
Member

wkcn commented Dec 18, 2018

@jermainewang
I use Arch Linux, Python 3.7.1, MXNet 1.5.0 installed by pip,and TVM 0.5.dev.
There is no any error. It's strange.

I will test it on the ubuntu server and docker.
Could you provide the error messeage in the test for TVM?

@zheng-da
Copy link
Contributor

it seems the bug only appears in Ubuntu 16.04, if I remember correctly. We tested in Ubuntu 18.04, and it works fine.

@jermainewang
Copy link
Contributor Author

Yeah, the bug does not occur on my arch machine too.

@jermainewang
Copy link
Contributor Author

Error message:

root@17d02f89890e:/tmp/dgl# for i in {0..100}; do echo $i && python3 tt-tvm.py || break ; done
0
1
2
3
Traceback (most recent call last):
  File "tt-tvm.py", line 12, in <module>
    y.asnumpy()
  File "/tmp/tvm/python/tvm/_ffi/ndarray.py", line 264, in asnumpy
    check_call(_LIB.TVMArrayCopyToBytes(self.handle, data, nbytes))
  File "/tmp/tvm/python/tvm/_ffi/base.py", line 72, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [16:29:28] /tmp/tvm/src/runtime/ndarray.cc:256: Check failed: arr_size == nbytes (16330860332007321568 vs. 16) TVMArrayCopyToBytes: size mismatch

Stack trace returned 10 entries:
[bt] (0) /tmp/tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11](unsigned long)+0x1fd) [0x7fe7c8dd7f6d]
[bt] (1) /tmp/tvm/build/libtvm.so(TVMArrayCopyToBytes+0x665) [0x7fe7c9407415]
[bt] (2) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7fe80aa3fe20]
[bt] (3) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7fe80aa3f88b]
[bt] (4) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7fe80aa3a01a]
[bt] (5) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7fe80aa2dfcb]
[bt] (6) python3(PyObject_Call+0x47) [0x5c20e7]
[bt] (7) python3(PyEval_EvalFrameEx+0x4ed6) [0x53b656]
[bt] (8) python3(PyEval_EvalFrameEx+0x4b14) [0x53b294]
[bt] (9) python3() [0x53fc97]

@wkcn
Copy link
Member

wkcn commented Dec 19, 2018

Reproduce the error on Ubuntu16.04
I print the datas of DLTensor in include/tvm/runtime/ndarray.h

 inline size_t GetDataSize(const DLTensor& arr) {                                                                                        
   size_t size = 1;                                                                                                                      
   cout << "DLTensor Ptr: " << &arr << endl;                                                                                             
   cout << "Shape Ptr: " << arr.shape << endl;                      
   cout << "ndim: " << arr.ndim << endl;                                                                     
   for (tvm_index_t i = 0; i < arr.ndim; ++i) {                                                                                          
     cout << "shape[" << i << "] = " << arr.shape[i] << endl;                                                                            
     size *= static_cast<size_t>(arr.shape[i]);                                                                                          
   }                                                                                                                                     
   cout << "Bits: " << int(arr.dtype.bits) << endl;                                                                                      
   cout << "lanes: " << int(arr.dtype.lanes) << endl;                                                                                    
   size *= (arr.dtype.bits * arr.dtype.lanes + 7) / 8;                                                                                   
   return size;                                                                                                                          
 }                                                       

Error Message:

DLTensor Ptr: 0x3496040
Shape Ptr: 0x34839f0
ndim: 1
shape[0] = -2992783055023189668
Bits: 64
lanes: 1

In MXNet, include/mxnet/tensor_blob.h

  inline void SetDLTensor(int dev_mask, int dev_id) {
    dltensor_.data = dptr_;
    dltensor_.ctx = DLContext{static_cast<DLDeviceType>(dev_mask), dev_id};
    dltensor_.ndim = shape_.ndim();
    dltensor_.dtype = DTypeTransform(type_flag_);
    dltensor_.shape = shape_.data();
    dltensor_.strides = nullptr;
    dltensor_.byte_offset = 0;
  }

It seems that TShape object shape_ has been reallocated, because shape_ is saved in a TBlob instance, however Tblob is mutable in NDArray.

@jermainewang
Copy link
Contributor Author

That's strange. I thought the mutable is for the data pointer while the shape array should not be changed.

@wkcn wkcn mentioned this issue Dec 20, 2018
6 tasks
@wkcn
Copy link
Member

wkcn commented Dec 20, 2018

I have fixed the bug in PR #13698

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants