[Bugfix] MX utest traversal memory corruption #312

jermainewang · 2018-12-16T06:59:12Z

Description

Before release, we found that MX utest sometimes failed on traversal related APIs. The crash happens non-deterministically, so at that time we suspected this to be a memory corruption. Did some investigation and found that it was related to the reference counting when an MX NDArray is converted to a DLPack and then gone out of scope. Following steps could reproduce this bug quite reliably:

Use dgllib/dgl-ci-mxnet-cpu docker image.
Create a file t.py with codes:

import os
os.environ['DGLBACKEND'] = 'mxnet'
import mxnet as mx
import numpy as np
import dgl

def foo():
    #return dgl.utils.toindex([0, 5]).todgltensor()  # this line is how bug is introduced in DGL.
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dgl.ndarray.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.asnumpy()

Run the t.py using for i in {0..100}; do echo $i && python3 t.py || break ; done

The error should happen when asnumpy() is called since the memory is corrupted.

As comparison, using Pytorch does not inflict such error:

import torch as th
from torch.utils import dlpack
import numpy as np
import dgl

def foo():
    x = th.tensor([0, 5], dtype=th.int64)
    dl = dlpack.to_dlpack(x)
    return dgl.ndarray.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.asnumpy()

Suspect reason: MX NDArray somehow destroyed the underlying DLPack tensor even though our NDArray has owned the underlying data. Need help from MX team who implemented this.

Temporary workaround: Avoid use temporary Index object. See the fixes for more details.

Might related to #291 .

@zheng-da

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR

zheng-da · 2018-12-16T07:50:30Z

it seems the code below works fine. not sure what the difference is between mx.nd.from_dlpack and dgl.ndarray.from_dlpack.

import os
os.environ['DGLBACKEND'] = 'mxnet'
import mxnet as mx
import numpy as np

def foo():
    #return dgl.utils.toindex([0, 5]).todgltensor()  # this line is how bug is introduced in DGL.
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return mx.nd.from_dlpack(dl)
    
for i in range(10):
    y = foo()
    y.asnumpy()

jermainewang · 2018-12-16T17:32:06Z

Tried MX -> DLPack -> Torch, it can reliably crash by segfault:

import mxnet as mx
from torch.utils import dlpack

def foo():
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dlpack.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.numpy()

Error:

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1fef5a) [0x7f4a09186f5a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31383b6) [0x7f4a0c0c03b6]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f4a259324b0]
[bt] (3) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::TypeDefault::tensorFromBlob(void*, c10::ArrayRef<long>, c10::ArrayRef<long>, std::function<void (void*)> const&) const+0x61) [0x7f4996c4c741]
[bt] (4) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::fromDLPack(DLManagedTensor const*)+0x29f) [0x7f4996871e2f]
[bt] (5) /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so(THPModule_fromDLPack(_object*, _object*)+0x41) [0x7f49e2e2f341]
[bt] (6) python3(PyEval_EvalFrameEx+0x4d06) [0x53b486]
[bt] (7) python3(PyEval_EvalFrameEx+0x4b14) [0x53b294]
[bt] (8) python3() [0x53fc97]
[bt] (9) python3(PyEval_EvalCode+0x1f) [0x5409bf]

Torch version 1.0.0.

jermainewang · 2018-12-17T03:59:44Z

@zheng-da Do you think we should approve this patch or wait more time for MX team's investigation?

zheng-da · 2018-12-17T04:56:15Z

it seems the conversion from dlpack to mxnet is different. let's merge the PR to temporarily fix the bug first.

wkcn · 2018-12-17T08:18:17Z

In MXNet, the strides of DLPack is nullptr, because "DLPack allow strides field to be nullptr, for mxnet do do not need to store strides, as we only support compact tensor for now."
https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L409

jermainewang and others added 3 commits December 15, 2018 17:38

WIP

3539c3c

temp fix mx traversal memory crash bug

201daf9

Merge branch 'master' into mx-debug

a87d352

jermainewang mentioned this pull request Dec 16, 2018

Converting MX array to DLPack crashes when MX array goes out-of-scope apache/mxnet#13658

Closed

Merge branch 'master' into mx-debug

2a4b071

zheng-da merged commit 0d0f443 into dmlc:master Dec 17, 2018

jermainewang deleted the mx-debug branch December 18, 2018 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] MX utest traversal memory corruption #312

[Bugfix] MX utest traversal memory corruption #312

jermainewang commented Dec 16, 2018

zheng-da commented Dec 16, 2018

jermainewang commented Dec 16, 2018

jermainewang commented Dec 17, 2018

zheng-da commented Dec 17, 2018

wkcn commented Dec 17, 2018

[Bugfix] MX utest traversal memory corruption #312

[Bugfix] MX utest traversal memory corruption #312

Conversation

jermainewang commented Dec 16, 2018

Description

Checklist

zheng-da commented Dec 16, 2018

jermainewang commented Dec 16, 2018

jermainewang commented Dec 17, 2018

zheng-da commented Dec 17, 2018

wkcn commented Dec 17, 2018