Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

TBlob bug about dltensor #15931

Closed
hzfan opened this issue Aug 16, 2019 · 5 comments
Closed

TBlob bug about dltensor #15931

hzfan opened this issue Aug 16, 2019 · 5 comments
Labels
Backend Issues related to the backend of MXNet Bug Operator

Comments

@hzfan
Copy link
Contributor

hzfan commented Aug 16, 2019

Description

TBlob does not disable/overload the default copy constructor/assignment, so the default one can be used. This results in shallow copy of dltensor_ (which is a field of type DLTensor in TBlob, see here) and memory leak.

Environment info (Required)

Python 3.7.3
Built from source (master at 5a4c01b)

Minimum reproducible example

To reproduce this error, I made a minor change to the function NumpyDotForward (in src/operator/numpy/np_dot-inl.h) for illustration.

Here is the function after my modification.
I modified one line, and added two lines (denoted by comments):

template<typename xpu>
inline void NumpyDotForward(const nnvm::NodeAttrs& attrs,
                            const OpContext& ctx,
                            const std::vector<TBlob>& inputs,
                            const std::vector<OpReqType>& req,
                            const std::vector<TBlob>& outputs) {
  using namespace mshadow;
  using namespace mxnet_op;

  CHECK_EQ(inputs.size(), 2U);
  CHECK_EQ(outputs.size(), 1U);

  const TBlob& a = inputs[0];
  const TBlob& b = inputs[1];
  // const TBlob& out = outputs[0];
  TBlob out = outputs[0];  // ** changed by me **
  const mxnet::TShape a_shape = a.shape_;
  const mxnet::TShape b_shape = b.shape_;
  out = out.reshape(out.shape_);  // ** added by me **
  out = TBlob(out.dltensor());  // ** added by me **
  MSHADOW_REAL_TYPE_SWITCH(out.type_flag_, DType, {
    if (b_shape.ndim() < 3) {
      // Case 1, 2, 3, 4, 5: a is N-D array (N >= 1) and b is vector or matrix, sum product
      //        over the last axis of a and the first axis of b
      TensordotIntAxesImpl<xpu>(1, ctx, a, b, out, req[0]);
    } else {
      // Case 3, 5.5: a is N-D array and b is M-D array (M > 2), sum product over the last axis
      //         of a and the 2nd-to-last axis of b
      const Tuple<int> a_axes_summed({a_shape.ndim() - 1});
      const Tuple<int> b_axes_summed({b_shape.ndim() - 2});
      TensordotImpl<xpu>(a_axes_summed, b_axes_summed, ctx, a, b, out, req);
    }
  });
}

Steps to reproduce

  1. replace NumpyDotForward with the above one
  2. build
  3. run the following
from mxnet import np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[1, 1], [1, 1], [1, 1]])
np.dot(a, b)

The expected result is

array([[ 6.,  6.],
       [15., 15.]])

But the real result is

array([[0., 0.],
       [0., 0.]])

Possible cause of this problem

TBlob.dltensor_.shape is a pointer. When TBlob b is assigned to TBlob a, the pointer gets shallow copied:

a.dltensor_.shape = b.dltensor_.shape

But b.dltensor_.shape points to b.shape_.data(). So when b is a temporary variable (like the return value of TBlob.reshape()), b.shape_.data() gets destroyed after the function returns. Now a.dltensor_.shape points to invalid memory.

Quick fix (IMO)

  • disable default assignment/copy constructor (declare them with private)
  • overload them and use SetDLTensor to avoid shallow copy

Comments

This bug has nothing to do with np.dot. I just used it for illustration.

Thank @yzhliu @reminisce @haojin2 for help.

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@hzfan
Copy link
Contributor Author

hzfan commented Aug 16, 2019

@mxnet-label-bot add [bug]

@marcoabreu marcoabreu added the Bug label Aug 16, 2019
@zachgk zachgk added Backend Issues related to the backend of MXNet Operator labels Aug 16, 2019
@reminisce
Copy link
Contributor

I too encountered the illegal-memory-access error probably resulted from the root cause revealed here (I called TBlob::reshape). I think overriding assignment operator and copy constructor with SetDLTensor() explicitly called is reasonable.

@hzfan
Copy link
Contributor Author

hzfan commented Aug 18, 2019

I created a fix in #15937 (comment) by overriding assignment operator and copy constructor with SetDLTensor() explicitly called, as @reminisce suggested.
As @yzhliu mentioned, this is not backward compatible. I'm not sure whether previous use of TBlob assignment will be affected. But after all, I think overriding it is necessary. Otherwise the dltensor converted from TBlob, which is heavily used when TVM is incorporated, will be corrupted.

@hzfan
Copy link
Contributor Author

hzfan commented Aug 19, 2019

Fixed by #15937 . Issue closed.

@hzfan hzfan closed this as completed Aug 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Bug Operator
Projects
None yet
Development

No branches or pull requests

5 participants