Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] MX utest traversal memory corruption #312

Merged
merged 4 commits into from
Dec 17, 2018

Conversation

jermainewang
Copy link
Member

Description

Before release, we found that MX utest sometimes failed on traversal related APIs. The crash happens non-deterministically, so at that time we suspected this to be a memory corruption. Did some investigation and found that it was related to the reference counting when an MX NDArray is converted to a DLPack and then gone out of scope. Following steps could reproduce this bug quite reliably:

  1. Use dgllib/dgl-ci-mxnet-cpu docker image.
  2. Create a file t.py with codes:
import os
os.environ['DGLBACKEND'] = 'mxnet'
import mxnet as mx
import numpy as np
import dgl

def foo():
    #return dgl.utils.toindex([0, 5]).todgltensor()  # this line is how bug is introduced in DGL.
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dgl.ndarray.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.asnumpy()
  1. Run the t.py using for i in {0..100}; do echo $i && python3 t.py || break ; done

The error should happen when asnumpy() is called since the memory is corrupted.

As comparison, using Pytorch does not inflict such error:

import torch as th
from torch.utils import dlpack
import numpy as np
import dgl

def foo():
    x = th.tensor([0, 5], dtype=th.int64)
    dl = dlpack.to_dlpack(x)
    return dgl.ndarray.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.asnumpy()

Suspect reason: MX NDArray somehow destroyed the underlying DLPack tensor even though our NDArray has owned the underlying data. Need help from MX team who implemented this.

Temporary workaround: Avoid use temporary Index object. See the fixes for more details.

Might related to #291 .

@zheng-da

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

@zheng-da
Copy link
Collaborator

it seems the code below works fine. not sure what the difference is between mx.nd.from_dlpack and dgl.ndarray.from_dlpack.

import os
os.environ['DGLBACKEND'] = 'mxnet'
import mxnet as mx
import numpy as np

def foo():
    #return dgl.utils.toindex([0, 5]).todgltensor()  # this line is how bug is introduced in DGL.
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return mx.nd.from_dlpack(dl)
    
for i in range(10):
    y = foo()
    y.asnumpy()

@jermainewang
Copy link
Member Author

Tried MX -> DLPack -> Torch, it can reliably crash by segfault:

import mxnet as mx
from torch.utils import dlpack

def foo():
    x = mx.nd.array([0, 5], dtype='int64')
    dl = x.to_dlpack_for_read()
    return dlpack.from_dlpack(dl)

for i in range(10):
    y = foo()
    y.numpy()

Error:

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1fef5a) [0x7f4a09186f5a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31383b6) [0x7f4a0c0c03b6]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f4a259324b0]
[bt] (3) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::TypeDefault::tensorFromBlob(void*, c10::ArrayRef<long>, c10::ArrayRef<long>, std::function<void (void*)> const&) const+0x61) [0x7f4996c4c741]
[bt] (4) /usr/local/lib/python3.5/dist-packages/torch/lib/libcaffe2.so(at::fromDLPack(DLManagedTensor const*)+0x29f) [0x7f4996871e2f]
[bt] (5) /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so(THPModule_fromDLPack(_object*, _object*)+0x41) [0x7f49e2e2f341]
[bt] (6) python3(PyEval_EvalFrameEx+0x4d06) [0x53b486]
[bt] (7) python3(PyEval_EvalFrameEx+0x4b14) [0x53b294]
[bt] (8) python3() [0x53fc97]
[bt] (9) python3(PyEval_EvalCode+0x1f) [0x5409bf]

Torch version 1.0.0.

@jermainewang
Copy link
Member Author

@zheng-da Do you think we should approve this patch or wait more time for MX team's investigation?

@zheng-da
Copy link
Collaborator

it seems the conversion from dlpack to mxnet is different. let's merge the PR to temporarily fix the bug first.

@zheng-da zheng-da merged commit 0d0f443 into dmlc:master Dec 17, 2018
@wkcn
Copy link
Member

wkcn commented Dec 17, 2018

In MXNet, the strides of DLPack is nullptr, because "DLPack allow strides field to be nullptr, for mxnet do do not need to store strides, as we only support compact tensor for now."
https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L409

@jermainewang jermainewang deleted the mx-debug branch December 18, 2018 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants