Converting MX array to DLPack crashes when MX array goes out-of-scope #13658

jermainewang · 2018-12-16T17:43:29Z

Description

Converting MX NDArray to DLPack, then to other framework's DLPack-compatible NDArray causes memory corruption when the origin MX NDArray goes out-of-scope.

Environment info (Required)

----------Python Info----------
Version      : 3.5.2
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Nov 12 2018 13:43:14')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.1
Directory    : /usr/local/lib/python3.5/dist-packages/pip
----------MXNet Info-----------
Version      : 1.4.0
Directory    : /usr/local/lib/python3.5/dist-packages/mxnet
Commit Hash   : 1f73c5d9d308a690b57ea1b474d2ba99ca06c476
----------System Info----------
Platform     : Linux-4.19.4-arch1-1-ARCH-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : 17d02f89890e
release      : 4.19.4-arch1-1-ARCH
version      : #1 SMP PREEMPT Fri Nov 23 09:06:58 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
Stepping:              4
CPU MHz:               1812.064
CPU max MHz:           3900.0000
CPU min MHz:           1200.0000
BogoMIPS:              7384.55
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0056 sec, LOAD: 0.4655 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0277 sec, LOAD: 0.4154 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0039 sec, LOAD: 0.1236 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1844 sec, LOAD: 1.0354 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0045 sec, LOAD: 0.0329 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0125 sec, LOAD: 0.6710 sec.

Package used (Python/R/Scala/Julia): Python

Error Message:

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1fef5a) [0x7f4a09186f5a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31383b6) [0x7f4a0c0c03b6]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f4a259324b0]
[bt] (3) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::TypeDefault::tensorFromBlob(void*, c10::ArrayRef<long>, c10::ArrayRef<long>, std::function<void (void*)> const&) const+0x61) [0x7f4996c4c741]
[bt] (4) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::fromDLPack(DLManagedTensor const*)+0x29f) [0x7f4996871e2f]
[bt] (5) /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so(THPModule_fromDLPack(_object*, _object*)+0x41) [0x7f49e2e2f341]
[bt] (6) python3(PyEval_EvalFrameEx+0x4d06) [0x53b486]
[bt] (7) python3(PyEval_EvalFrameEx+0x4b14) [0x53b294]
[bt] (8) python3() [0x53fc97]
[bt] (9) python3(PyEval_EvalCode+0x1f) [0x5409bf]

Minimum reproducible example

import mxnet as mx
from torch.utils import dlpack

def foo():
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dlpack.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.numpy()

Torch version v1.0.0

Steps to reproduce

(Paste the commands you ran that produced the error.)

Use a ubuntu 16.04 image (with mx and torch installed)
Run the above code

What have you tried to solve it?

Found this bug in DGL project dmlc/dgl#312 . Tried:

MXArray -> DLPack -> DGL Array : FAILED
MXArray -> DLPack -> MXArray : SUCCEED
MXArray -> DLPack -> Torch Tensor : FAILED
Torch Tensor -> DLPack -> DGL Array : SUCCEED

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2018-12-16T17:43:32Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.

zheng-da · 2018-12-17T06:44:27Z

@tqchen @wkcn could you please take a look at this problem? Thanks

wkcn · 2018-12-17T08:11:03Z

DLPack allow strides field to be nullptr, for mxnet do do not need to store strides, as we only support compact tensor for now.
https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L409

However, PyTorch doesn't accept the DLPack whose strides is nullptr.

zheng-da · 2018-12-17T14:27:33Z

the error might be trickier. For DGL code below, the code is undeterministic. We have to run it multiple times before we can see crash. It seems that some memory in dlpack exported from MXNet isn't referenced. However, if I use mx.nd.from_dlpack(dl), it works fine. @wkcn do you have any suggestions?

import os
os.environ['DGLBACKEND'] = 'mxnet'
import mxnet as mx
import numpy as np
import dgl

def foo():
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dgl.ndarray.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.asnumpy()

jermainewang · 2018-12-17T15:46:22Z

@wkcn This explains the torch case thank you. In DGL, we actually handled this:
https://github.com/dmlc/dgl/blob/632d598c77af616278bc0f2144a14958678dcbae/src/runtime/ndarray.cc#L83-L89
In DGL, we also only support contiguous tensor, so stride_ is never used elsewhere. The NDArray class is borrowed from TVM project. I wonder whether this bug happens in TVM too.

wkcn · 2018-12-17T15:49:52Z

@zheng-da mx.nd.from_dlpack can accept the DLPack whose strides is nullptr.
Could you please try the MXNet PR which fills the strides of DLPack?
The code you wrote works fine in mxnet==1.5.0b20181216.
I couldn't reproduce the error.

wkcn · 2018-12-17T15:51:26Z

@jermainewang Let me test it for TVM.

import tvm
import mxnet as mx

tvm_a = tvm.ndarray.array([1, 2, 3])
tvm_pack = tvm_a.to_dlpack()

mx_a = mx.nd.from_dlpack(tvm_pack)
print(mx_a)

mx_b = mx.nd.array([4, 5, 6])
mx_pack = mx_b.to_dlpack_for_write()
tvm_b = tvm.nd.from_dlpack(mx_pack)
print(tvm_b)

mxnet==1.5.0b20181216
tvm: 0.5.dev

It works fine in TVM.

jermainewang · 2018-12-18T03:35:00Z

@wkcn , try following steps:

Use a ubuntu 16.04 image.
Create file t.py with following codes:

import mxnet as mx
import numpy as np
import tvm

def foo():
  x = mx.nd.array([0, 5], dtype='int64')
  dl = x.to_dlpack_for_read()
  return tvm.nd.from_dlpack(dl)

for i in range(10):
  y = foo()
  y.asnumpy()

Run it with for i in {0..100}; do echo $i && python3 t.py || break ; done

I used a ubuntu docker image and could reproduce the error.

wkcn · 2018-12-18T09:08:54Z

@jermainewang
I use Arch Linux, Python 3.7.1, MXNet 1.5.0 installed by pip,and TVM 0.5.dev.
There is no any error. It's strange.

I will test it on the ubuntu server and docker.
Could you provide the error messeage in the test for TVM?

zheng-da · 2018-12-18T10:29:41Z

it seems the bug only appears in Ubuntu 16.04, if I remember correctly. We tested in Ubuntu 18.04, and it works fine.

jermainewang · 2018-12-18T16:28:35Z

Yeah, the bug does not occur on my arch machine too.

jermainewang · 2018-12-18T16:29:54Z

Error message:

root@17d02f89890e:/tmp/dgl# for i in {0..100}; do echo $i && python3 tt-tvm.py || break ; done
0
1
2
3
Traceback (most recent call last):
  File "tt-tvm.py", line 12, in <module>
    y.asnumpy()
  File "/tmp/tvm/python/tvm/_ffi/ndarray.py", line 264, in asnumpy
    check_call(_LIB.TVMArrayCopyToBytes(self.handle, data, nbytes))
  File "/tmp/tvm/python/tvm/_ffi/base.py", line 72, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [16:29:28] /tmp/tvm/src/runtime/ndarray.cc:256: Check failed: arr_size == nbytes (16330860332007321568 vs. 16) TVMArrayCopyToBytes: size mismatch

Stack trace returned 10 entries:
[bt] (0) /tmp/tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11](unsigned long)+0x1fd) [0x7fe7c8dd7f6d]
[bt] (1) /tmp/tvm/build/libtvm.so(TVMArrayCopyToBytes+0x665) [0x7fe7c9407415]
[bt] (2) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7fe80aa3fe20]
[bt] (3) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7fe80aa3f88b]
[bt] (4) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7fe80aa3a01a]
[bt] (5) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7fe80aa2dfcb]
[bt] (6) python3(PyObject_Call+0x47) [0x5c20e7]
[bt] (7) python3(PyEval_EvalFrameEx+0x4ed6) [0x53b656]
[bt] (8) python3(PyEval_EvalFrameEx+0x4b14) [0x53b294]
[bt] (9) python3() [0x53fc97]

wkcn · 2018-12-19T00:55:54Z

Reproduce the error on Ubuntu16.04
I print the datas of DLTensor in include/tvm/runtime/ndarray.h

 inline size_t GetDataSize(const DLTensor& arr) {                                                                                        
   size_t size = 1;                                                                                                                      
   cout << "DLTensor Ptr: " << &arr << endl;                                                                                             
   cout << "Shape Ptr: " << arr.shape << endl;                      
   cout << "ndim: " << arr.ndim << endl;                                                                     
   for (tvm_index_t i = 0; i < arr.ndim; ++i) {                                                                                          
     cout << "shape[" << i << "] = " << arr.shape[i] << endl;                                                                            
     size *= static_cast<size_t>(arr.shape[i]);                                                                                          
   }                                                                                                                                     
   cout << "Bits: " << int(arr.dtype.bits) << endl;                                                                                      
   cout << "lanes: " << int(arr.dtype.lanes) << endl;                                                                                    
   size *= (arr.dtype.bits * arr.dtype.lanes + 7) / 8;                                                                                   
   return size;                                                                                                                          
 }

Error Message:

DLTensor Ptr: 0x3496040
Shape Ptr: 0x34839f0
ndim: 1
shape[0] = -2992783055023189668
Bits: 64
lanes: 1

In MXNet, include/mxnet/tensor_blob.h

  inline void SetDLTensor(int dev_mask, int dev_id) {
    dltensor_.data = dptr_;
    dltensor_.ctx = DLContext{static_cast<DLDeviceType>(dev_mask), dev_id};
    dltensor_.ndim = shape_.ndim();
    dltensor_.dtype = DTypeTransform(type_flag_);
    dltensor_.shape = shape_.data();
    dltensor_.strides = nullptr;
    dltensor_.byte_offset = 0;
  }

It seems that TShape object shape_ has been reallocated, because shape_ is saved in a TBlob instance, however Tblob is mutable in NDArray.

jermainewang · 2018-12-19T17:20:54Z

That's strange. I thought the mutable is for the data pointer while the shape array should not be changed.

wkcn · 2018-12-20T04:27:08Z

I have fixed the bug in PR #13698

eric-haibin-lin added the Bug label Dec 17, 2018

wkcn mentioned this issue Dec 17, 2018

[MXNET-1266] Filling the strides of DLPack #13663

Closed

6 tasks

wkcn mentioned this issue Dec 20, 2018

Fix NDArray ToDLPack Bug #13698

Merged

6 tasks

jermainewang closed this as completed Dec 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting MX array to DLPack crashes when MX array goes out-of-scope #13658

Converting MX array to DLPack crashes when MX array goes out-of-scope #13658

jermainewang commented Dec 16, 2018

mxnet-label-bot commented Dec 16, 2018

zheng-da commented Dec 17, 2018

wkcn commented Dec 17, 2018 •

edited

Loading

zheng-da commented Dec 17, 2018

jermainewang commented Dec 17, 2018

wkcn commented Dec 17, 2018 •

edited

Loading

wkcn commented Dec 17, 2018 •

edited

Loading

jermainewang commented Dec 18, 2018

wkcn commented Dec 18, 2018 •

edited

Loading

zheng-da commented Dec 18, 2018

jermainewang commented Dec 18, 2018

jermainewang commented Dec 18, 2018

wkcn commented Dec 19, 2018 •

edited

Loading

jermainewang commented Dec 19, 2018

wkcn commented Dec 20, 2018

Converting MX array to DLPack crashes when MX array goes out-of-scope #13658

Converting MX array to DLPack crashes when MX array goes out-of-scope #13658

Comments

jermainewang commented Dec 16, 2018

Description

Environment info (Required)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

mxnet-label-bot commented Dec 16, 2018

zheng-da commented Dec 17, 2018

wkcn commented Dec 17, 2018 • edited Loading

zheng-da commented Dec 17, 2018

jermainewang commented Dec 17, 2018

wkcn commented Dec 17, 2018 • edited Loading

wkcn commented Dec 17, 2018 • edited Loading

jermainewang commented Dec 18, 2018

wkcn commented Dec 18, 2018 • edited Loading

zheng-da commented Dec 18, 2018

jermainewang commented Dec 18, 2018

jermainewang commented Dec 18, 2018

wkcn commented Dec 19, 2018 • edited Loading

jermainewang commented Dec 19, 2018

wkcn commented Dec 20, 2018

wkcn commented Dec 17, 2018 •

edited

Loading

wkcn commented Dec 17, 2018 •

edited

Loading

wkcn commented Dec 17, 2018 •

edited

Loading

wkcn commented Dec 18, 2018 •

edited

Loading

wkcn commented Dec 19, 2018 •

edited

Loading