TBlob bug about dltensor #15931

hzfan · 2019-08-16T18:03:44Z

Description

TBlob does not disable/overload the default copy constructor/assignment, so the default one can be used. This results in shallow copy of dltensor_ (which is a field of type DLTensor in TBlob, see here) and memory leak.

Environment info (Required)

Python 3.7.3
Built from source (master at 5a4c01b)

Minimum reproducible example

To reproduce this error, I made a minor change to the function NumpyDotForward (in src/operator/numpy/np_dot-inl.h) for illustration.

Here is the function after my modification.
I modified one line, and added two lines (denoted by comments):

template<typename xpu>
inline void NumpyDotForward(const nnvm::NodeAttrs& attrs,
                            const OpContext& ctx,
                            const std::vector<TBlob>& inputs,
                            const std::vector<OpReqType>& req,
                            const std::vector<TBlob>& outputs) {
  using namespace mshadow;
  using namespace mxnet_op;

  CHECK_EQ(inputs.size(), 2U);
  CHECK_EQ(outputs.size(), 1U);

  const TBlob& a = inputs[0];
  const TBlob& b = inputs[1];
  // const TBlob& out = outputs[0];
  TBlob out = outputs[0];  // ** changed by me **
  const mxnet::TShape a_shape = a.shape_;
  const mxnet::TShape b_shape = b.shape_;
  out = out.reshape(out.shape_);  // ** added by me **
  out = TBlob(out.dltensor());  // ** added by me **
  MSHADOW_REAL_TYPE_SWITCH(out.type_flag_, DType, {
    if (b_shape.ndim() < 3) {
      // Case 1, 2, 3, 4, 5: a is N-D array (N >= 1) and b is vector or matrix, sum product
      //        over the last axis of a and the first axis of b
      TensordotIntAxesImpl<xpu>(1, ctx, a, b, out, req[0]);
    } else {
      // Case 3, 5.5: a is N-D array and b is M-D array (M > 2), sum product over the last axis
      //         of a and the 2nd-to-last axis of b
      const Tuple<int> a_axes_summed({a_shape.ndim() - 1});
      const Tuple<int> b_axes_summed({b_shape.ndim() - 2});
      TensordotImpl<xpu>(a_axes_summed, b_axes_summed, ctx, a, b, out, req);
    }
  });
}

Steps to reproduce

replace NumpyDotForward with the above one
build
run the following

from mxnet import np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[1, 1], [1, 1], [1, 1]])
np.dot(a, b)

The expected result is

array([[ 6.,  6.],
       [15., 15.]])

But the real result is

array([[0., 0.],
       [0., 0.]])

Possible cause of this problem

TBlob.dltensor_.shape is a pointer. When TBlob b is assigned to TBlob a, the pointer gets shallow copied:

a.dltensor_.shape = b.dltensor_.shape

But b.dltensor_.shape points to b.shape_.data(). So when b is a temporary variable (like the return value of TBlob.reshape()), b.shape_.data() gets destroyed after the function returns. Now a.dltensor_.shape points to invalid memory.

Quick fix (IMO)

disable default assignment/copy constructor (declare them with private)
overload them and use SetDLTensor to avoid shallow copy

Comments

This bug has nothing to do with np.dot. I just used it for illustration.

Thank @yzhliu @reminisce @haojin2 for help.

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-08-16T18:03:47Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

hzfan · 2019-08-16T18:08:11Z

@mxnet-label-bot add [bug]

reminisce · 2019-08-17T06:10:13Z

I too encountered the illegal-memory-access error probably resulted from the root cause revealed here (I called TBlob::reshape). I think overriding assignment operator and copy constructor with SetDLTensor() explicitly called is reasonable.

hzfan · 2019-08-18T09:02:55Z

I created a fix in #15937 (comment) by overriding assignment operator and copy constructor with SetDLTensor() explicitly called, as @reminisce suggested.
As @yzhliu mentioned, this is not backward compatible. I'm not sure whether previous use of TBlob assignment will be affected. But after all, I think overriding it is necessary. Otherwise the dltensor converted from TBlob, which is heavily used when TVM is incorporated, will be corrupted.

hzfan · 2019-08-19T06:44:01Z

Fixed by #15937 . Issue closed.

marcoabreu added the Bug label Aug 16, 2019

zachgk added Backend Issues related to the backend of MXNet Operator labels Aug 16, 2019

hzfan mentioned this issue Aug 18, 2019

Fix illegal memory access bug in TBlob #15937

Merged

hzfan closed this as completed Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBlob bug about dltensor #15931

TBlob bug about dltensor #15931

hzfan commented Aug 16, 2019 •

edited

Loading

mxnet-label-bot commented Aug 16, 2019

hzfan commented Aug 16, 2019

reminisce commented Aug 17, 2019

hzfan commented Aug 18, 2019 •

edited

Loading

hzfan commented Aug 19, 2019

TBlob bug about dltensor #15931

TBlob bug about dltensor #15931

Comments

hzfan commented Aug 16, 2019 • edited Loading

Description

Environment info (Required)

Minimum reproducible example

Steps to reproduce

Possible cause of this problem

Quick fix (IMO)

Comments

mxnet-label-bot commented Aug 16, 2019

hzfan commented Aug 16, 2019

reminisce commented Aug 17, 2019

hzfan commented Aug 18, 2019 • edited Loading

hzfan commented Aug 19, 2019

hzfan commented Aug 16, 2019 •

edited

Loading

hzfan commented Aug 18, 2019 •

edited

Loading