-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] MX utest traversal memory corruption #312
Conversation
it seems the code below works fine. not sure what the difference is between import os
os.environ['DGLBACKEND'] = 'mxnet'
import mxnet as mx
import numpy as np
def foo():
#return dgl.utils.toindex([0, 5]).todgltensor() # this line is how bug is introduced in DGL.
x = mx.nd.array([0, 5], dtype='int64')
dl = x.to_dlpack_for_read()
return mx.nd.from_dlpack(dl)
for i in range(10):
y = foo()
y.asnumpy() |
Tried MX -> DLPack -> Torch, it can reliably crash by segfault: import mxnet as mx
from torch.utils import dlpack
def foo():
x = mx.nd.array([0, 5], dtype='int64')
dl = x.to_dlpack_for_read()
return dlpack.from_dlpack(dl)
for i in range(10):
y = foo()
y.numpy() Error:
Torch version 1.0.0. |
@zheng-da Do you think we should approve this patch or wait more time for MX team's investigation? |
it seems the conversion from dlpack to mxnet is different. let's merge the PR to temporarily fix the bug first. |
In MXNet, the |
Description
Before release, we found that MX utest sometimes failed on traversal related APIs. The crash happens non-deterministically, so at that time we suspected this to be a memory corruption. Did some investigation and found that it was related to the reference counting when an MX NDArray is converted to a DLPack and then gone out of scope. Following steps could reproduce this bug quite reliably:
dgllib/dgl-ci-mxnet-cpu
docker image.t.py
with codes:t.py
usingfor i in {0..100}; do echo $i && python3 t.py || break ; done
The error should happen when
asnumpy()
is called since the memory is corrupted.As comparison, using Pytorch does not inflict such error:
Suspect reason: MX NDArray somehow destroyed the underlying DLPack tensor even though our NDArray has owned the underlying data. Need help from MX team who implemented this.
Temporary workaround: Avoid use temporary Index object. See the fixes for more details.
Might related to #291 .
@zheng-da
Checklist
Please feel free to remove inapplicable items for your PR.
or have been fixed to be compatible with this change