Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-779]Add DLPack Transformation API #12047

Merged
merged 31 commits into from
Sep 22, 2018
Merged

Conversation

wkcn
Copy link
Member

@wkcn wkcn commented Aug 6, 2018

Description

Add DLPack Transformation API

DLPack is important for sharing tensors and operators among Deep Learning Frameworks.
Issue: #11964

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add MXNDArrayToDLPack, MXNDArrayFromDLPack, MXNDArrayCallDLPackDeleter and MXNDArrayCallDLPackCapsuleDeleter in c_api
  • Add ToDLPack() and FromDLPack for the class NDArray
  • Add the constructor TBlob(const DLTensor &dltensor)
  • Add the function DLDataTypeTransform in tensor_blob.h for the transformation from DLDataType to MShadow Type Flag
  • Add NDArray(const TBlob &data, int dev_id, const std::function<void()>& deleter) function
  • Add mx.nd.NDArray.to_dlpack_for_read, mx.nd.NDArray.to_dlpack_for_write, mx.nd.to_dlpack_to_read, mx.nd.to_dlpack_for_write and mx.nd.from_dlpack
  • Add unit-test for DLPack Transformation API

Comments

Extra Test with PyTorch

@wkcn wkcn changed the title Add DLPack Transformation API [MXNET-779]Add DLPack Transformation API Aug 6, 2018
Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good change, I will make a few additional comments below

@@ -31,10 +31,10 @@

class NDArrayBase(object):
"""Base data structure for ndarray"""
__slots__ = ["handle", "writable"]
__slots__ = ["handle", "writable", "dlpack_handle"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not put dlpack_handle inside it, instead directly return a pycapsule that contains the handle


def __dealloc__(self):
CALL(MXNDArrayFree(self.chandle))
if self.cdlpack_handle:
CALL(MXNDArrayCallDLPackDeleter(self.cdlpack_handle))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not put dlpack_handle inside the object

struct NDArrayDLManager {
NDArray handle; // ref NDArray
DLManagedTensor tensor;
TShape strides; // store variable strides
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DLPack allow strides field to be nullptr, for mxnet do do not need to store strides, as we only support compact tensor for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During conversion to mxnet tensor, check the strides field and see if it is compact

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.
It seems that PyTorch.utils.from_dlpack doesn't allow strides nullptr.

if (!is_none()) {
dlmanager->tensor.dl_tensor = data().dltensor();
// assign value for dl_tensor.strides
if (!dlmanager->tensor.dl_tensor.strides) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strides == nullptr

@tqchen
Copy link
Member

tqchen commented Aug 6, 2018

see my additional comments in #11964

@wkcn
Copy link
Member Author

wkcn commented Aug 7, 2018

@tqchen Thank you!
I have changed it as you suggest.

* \param out_dlpack pointer holder to get pointer of DLManagedTensor
* \return 0 when success, -1 when failure happens
*/
MXNET_DLL int MXNDArrayToDLPackForRead(NDArrayHandle handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From API point of view, we can just expose ToDLPack, and in the python API, explicitly call wait_for_read and wait_for_write

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify it.

* \param dlpack_capsule the pointer of a PyCapsule storing DLManagedTensor
* \return 0 when success, -1 when failure happens
*/
MXNET_DLL void MXNDArrayCallDLPackCapsuleDeleter(PyObjectHandle dlpack_capsule);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less certain why we need the deleter function here, can they be directly handled in the python/cython side?

Copy link
Member Author

@wkcn wkcn Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement a deleter function in python, however the deleter function may be released by Python GC before calling the deleter function. See the test Code. It will raise segmentation fault.
The Python Frontend of MXNet both uses Python(ctypes) and Cython. It may be impossible to implement the deleter function in ctypes.
So the deleter function should be implemented in C++.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at destructor at apache/tvm#1573

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some subtlty here but they can never-the-less be implemented

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ever tried to write a python function as the destructor, but it can't pass CI.
Please see the MXNet CI result
All windows test_dlpack failed because the destructor written in Python is released before calling it.

PyTorch implemented the destructor using Python API in C++, and CuPy implemented it by cython, namely the code will be built by C++.
However, MXNet uses ctypes and cython. I couldn't find a better way to implement the destructor except writing it in MXNet C++ API.

Copy link
Member Author

@wkcn wkcn Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I knew the trick and tried it in my previous PR. But it failed in Windows Test.
Related CI

It seems that the CI of TVM doesn't have Windows Test so the CI is passed.
The reason is that the destructor will be released by Python GC before calling it.
And the GC release order are different between Linux and Windows.

In Linux, the destructor is called first, then the destructor is released. So it works.
However, In Windows, the destructor is released first before calling it, it doesn't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange as destructor itself sits in the global scope and should be destructed after the dltensors(which have a local scope)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
In the test code, it works in Linux but failed in Windows.

Copy link
Member

@tqchen tqchen Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two problems in your particular gist you paste.

  • The destructor need to be declared in the global scope(instead of constructing when passing to the argument)
  • THe cstring need to outlive the capsule(construct a global string)
  • The function need to outlive the capsule(constructor c func and put it under global scope/module)
cfunc = ctypes.CFUNCTYPE(None, ctypes.c_void_p)

def dfunc(dltensor):
    pycaps = ctypes.cast(dltensor, ctypes.py_object)
    pass

c_destructor = cfunc(dfunc)
c_str_dltensor = ctypes.c_char_p(b"dltensor")

def test():
    a = ctypes.pythonapi.PyCapsule_New(1, c_str_dltensor, c_destructor)
test()

Copy link
Member Author

@wkcn wkcn Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
I found it works on Windows and Linux.
I have updated the PR.

* \brief constructor that construct TBlob from DLTensor
* \param DLTensor Object
*/
explicit TBlob(const DLTensor &dltensor) : dptr_(dltensor.data),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add compactness check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, TBlob only support compact tensors, need to check strides == null or the strides reflect a compact setting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will move the strides check from ndarray.cpp to tensor_blob.h.

# pylint: disable= no-member

def __init__(self, handle, writable=True):
def __init__(self, handle, writable=True, dlpack=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dlpack should not be part of the member, the PyCapsule manages itself

Copy link
Member Author

@wkcn wkcn Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dlpack in NDArray is PyCapsule which is the return value of to_dlpack_for_read/write.
When calling FromDLPack function, NDArray need to manage the release of NDArrayDLManager::handle in ndarray.cc.
e.g.

a = mx.nd.array([1,2,3]) # Denote TBlob x to store the data
pack = a.to_dlpack_for_write()
b = mx.nd.from_dlpack(pack)
del a, pack
# a and PyCapsule pack has been released.
# b need to manage the release TBlob x.

NDArray doesn't have the deleter function, so I made dlpack as a member of NDArray.
When the reference count of dlpack in NDArray is zero, the TBlob will be released.
Is there any other way to keep the reference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better way is to keep NDArray's shared_ptr inside the manager_ctx itself, you can take a look at TVM's NDArray to DLManagedTesnor impl

Copy link
Member Author

@wkcn wkcn Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NDArray in MXNet and TVM are different. NDArray in TVM has the function IncRef and DecRef to change the reference count, however that in MXNet uses NDArray::ptr_ (std::shared_ptr) to manage the reference count. NDArray::ptr_ is a private member of NDArray.
The PR is similar to the PyTorch DLPack implementation Code
I add a NDArrayDLManager to manage the reference count of NDArray. Code line 315 in src/ndarray/ndarray.cc

Setting dlpack as the NDArray(Python Class) member is to avoid the release of the store of data, e.g. When the original NDArray and the PyCapsule (DLPack) are release, the new NDArray (generated by from_dlpack) still exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can create a new NDArray() that copies the original NDArray(which increases refcount) and put that as a context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your case, when a get deleted, b still holds a NDArrayDLManager, which is allocated by new, and that object still hold NDArray(which holds a shared_ptr), so the original resource won't be released

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to be careful to use shape from the same NDArray in your NDArrayDLManager

Copy link
Member Author

@wkcn wkcn Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
But I think the new NDArray object b couldn't hold a shared_ptr as the same as the original NDArray a. The new NDArray b only get the data pointer rather than shared_ptr from DLPack.

In the other case,

from torch.utils import dlpack
a = torch.array([1,2,3])
pack = dlpack.to_dlpack(a)
b = mx.nd.from_dlpack(pack)
del a, pack

When dlpack.to_dlpack is called, PyTorch will allocate ATenDLMTensor which increases the refcount of Torch Tensor code.
After the variables a and pack are released, ATenDLMTensor still exists.
I think the deleter should be called by the new NDArray b when the NDArray b releases. Refer to PyTorch FromDLPack. However, NDArray doesn't have explicit deleter parameter.

In my PR, from_dlpack will copy the dlpack object.
When the old dlpack pack releases, it doesn't call the deleter.
The new dlpack b.dlpack will be a member of the new NDArray b as NDArray(handle=handle, dlpack=dlpack_copy).
When the new NDArray b releases, the new dlpack b.dlpack will be released, then call the deleter by the new dlpack b.dlpack. And the deleter will release NDArrayDLManager or ATenDLMTensor. The refcount of the old NDArray a will decrease 1.

Copy link
Member

@tqchen tqchen Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you copy the NDArray, they hold the same shared_ptr to the data, note that shared_ptr can be copied, and its ref counter is automatically managed

Copy link
Member Author

@wkcn wkcn Aug 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a copy of NDArray as the member of NDArrayDLManager, and the copy increase the refcount.
I'm confused how to modify the PR.
After creating a new NDArray(Python) from DLPack, then delete the old NDArray(Python) and PyCapsule(DLPack).

Which object will call the deleter function?

In my case, when a gets deleted, how does b hold the NDArrayDLManager? It seems that b only get the pure data pointer from DLManagedTensor::dl_tensor::data, the type of the pointer is not shared pointer.
And how does b store the pointer to NDArrayDLManager in MXNet NDArray?
@tqchen

@tqchen
Copy link
Member

tqchen commented Aug 8, 2018

Since it is a C API change @piiswrong can you also take a look?

@nswamy nswamy added Backend Issues related to the backend of MXNet pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Aug 8, 2018
@szha szha requested review from piiswrong and removed request for szha August 9, 2018 18:58
check_call(_LIB.MXNDArrayWaitToWrite(data.handle))
dlpack = DLPackHandle()
check_call(_LIB.MXNDArrayToDLPack(data.handle, ctypes.byref(dlpack)))
return ctypes.pythonapi.PyCapsule_New(dlpack, b'dltensor', pycapsule_dlpack_deleter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot directly pass in string, instead use a global string object, PyCapsule requires the string to outlive the capsule

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved it. Thank you!



typedef struct {
char py_object[16];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be dangerous unless pycapsule is already standardized and won't change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's dangerous.
I want to call the function PyCapsule_GetPointer in c_api.cc,
however MXNet doesn't include Python.h header file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it in the latest PR.

@wkcn
Copy link
Member Author

wkcn commented Aug 22, 2018

@tqchen
Hi! What should I do next for this PR?
For the release of DLPack,
here are the implements of TVM, PyTorch and Cupy.

TVM src/runtime/ndarray.cc#L148

NDArray NDArray::FromDLPack(DLManagedTensor* tensor) {
  NDArray::Container* data = new NDArray::Container();
  data->deleter = Internal::DLPackDeleter;
  data->manager_ctx = tensor;
  data->dl_tensor = tensor->dl_tensor;
  return NDArray(data);
}

PyTorch aten/src/ATen/DLConvertor.cpp#L170

Tensor fromDLPack(const DLManagedTensor* src) {
  Backend backend = getATenBackend(src->dl_tensor.ctx);
  ScalarType stype = toScalarType(src->dl_tensor.dtype);
  auto deleter = [src](void * self) {
    src->deleter(const_cast<DLManagedTensor*>(src));
  };
  return getType(backend, stype).tensorFromBlob(
      src->dl_tensor.data,
      IntList(src->dl_tensor.shape, src->dl_tensor.ndim),
      IntList(src->dl_tensor.strides, src->dl_tensor.ndim),
      deleter);
}

CuPy cupy/core/dlpack.pyx#L231

mem = DLPackMemory(dltensor)
mem_ptr = memory.MemoryPointer(mem, mem.dlm_tensor.dl_tensor.byte_offset)
cupy_array = ndarray(shape_vec, cp_dtype, mem_ptr)

The all NDArray save the deleter of dlpack or DLManagedTensor.

So I added a new member dlpack for mx.nd.NDArray (Python) for store of dlpack.
When the NDArray release, the dlpack will call the deleter.
Dose NDArray(C++) manage the release of dltensor?

@tqchen
Copy link
Member

tqchen commented Aug 22, 2018

My point is that the DLManagerd Tensor should manage its own deletion and you do not have to add dlpack field for mx.nd.NDArray. Even when the frontend NDArray release, the internal data will be kept alive until the pycapsule get released and the DLManagedTensor's deleter get called

@wkcn
Copy link
Member Author

wkcn commented Aug 22, 2018

@tqchen Hi! I have modified the PR, the DLManaged Tensor will manage itself.
However, there is a problem in PR(DLManagedTensor manages itself).
In this case,

import mxnet as mx
pack = mx.nd.array([1,2,3]).to_dlpack_for_write()
b = mx.nd.from_dlpack(pack)
del pack # DLManagedTensor's deleter get called when `pack` get released.
print (b) # b != [1,2,3]

Does it need to manage the release of dlpack in NDArray Chunk (C++) ?

I add deleter in TBlob and NDArray::Chunk in the latest PR (add deleter for TBlob and Chunk in NDArray), and solve the problem.

@tqchen
Copy link
Member

tqchen commented Aug 22, 2018

pack is only supposed to be consumed once, when from_dlpack is called, the pack itself should be marked as used and its deleter will no longer be called when the capsule get destroyed

@wkcn
Copy link
Member Author

wkcn commented Aug 22, 2018

@tqchen Yes. The latest PR has implemented it in ndarray.py.
And the new NDArray created by from_dlpack will call the deleter.

@wkcn
Copy link
Member Author

wkcn commented Sep 9, 2018

@tqchen @piiswrong
Hi!
Is there any problem? Could you please merge the PR?
Thank you!

@@ -89,7 +92,8 @@ class TBlob {
template<typename DType>
TBlob(DType *dptr, const TShape &shape, int dev_mask, int dev_id = -1)
: dptr_(dptr), shape_(shape),
type_flag_(mshadow::DataType<DType>::kFlag) {
type_flag_(mshadow::DataType<DType>::kFlag),
deleter_(nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot introduce deleter to the TBlob

@@ -831,7 +855,7 @@ class NDArray {

Chunk(const NDArrayStorageType storage_type_, const TBlob &data,
const std::vector<TBlob> &aux_data, int dev_id)
: static_data(true), delay_alloc(false), storage_type(storage_type_) {
: static_data(true), delay_alloc(false), storage_type(storage_type_), deleter_(nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot plug deleter to the TBlob and Chunk, this seems will cause more problems than what it can solve.

std::shared_ptr do support customized deleter already, and maybe that can be used to help solve this issue of having to hide a customized deleter for NDArray

@tqchen
Copy link
Member

tqchen commented Sep 9, 2018

@wkcn sorry I was away for a vacation recently, please see my comments

@wkcn
Copy link
Member Author

wkcn commented Sep 10, 2018

@tqchen Thank you! I will try to change it.

@wkcn
Copy link
Member Author

wkcn commented Sep 11, 2018

@tqchen Hi! I have updated the code as your comments.
I add a new constructor NDArray(const TBlob &data, int dev_id, const std::function<void()>& deleter) for NDArray class, because the performance between std::make_shared and std::shared_ptr::shared_ptr is different.

@tqchen
Copy link
Member

tqchen commented Sep 21, 2018

I think it is good to go from my end. thanks @wkcn . @piiswrong please take another look, if there is no issues I will merge this in in two days

@tqchen tqchen added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Sep 21, 2018
@tqchen
Copy link
Member

tqchen commented Sep 21, 2018

@szha can you also take a round of review and if there is no problem, merge this in?

@wkcn
Copy link
Member Author

wkcn commented Sep 21, 2018

@tqchen Thank you so much!

@tqchen tqchen merged commit 9f14a68 into apache:master Sep 22, 2018
@tqchen
Copy link
Member

tqchen commented Sep 22, 2018

This is now merged

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants