broadcast_to() behavior change after new 0D and uninitialized shape support added? #14954

DickJC123 · 2019-05-15T02:18:54Z

The documentation for broadcast_to(..., shape=<output_shape>,...) suggests that '0' can appear as a placeholder that means 'keep the same dimension as the input' for the given dimension. However, the C++ code requires -1 for this. Which is correct? Are we changing the behavior? @reminisce @KellenSunderland

Documentation:

Broadcasting is allowed on axes with size 1, such as from (2,1,3,1) to (2,8,3,9). Elements will be duplicated on the broadcasted axes.  For example:

broadcast_to([[1,2,3]], shape=(2,3)) = [[ 1.,  2.,  3.],
                                        [ 1.,  2.,  3.]])

The dimension which you do not want to change can also be kept as 0 which means copy the original value. So with shape=(2,0), we will obtain the same result as in the above example.

Defined in src/operator/tensor/broadcast_reduce_op_value.cc:L262

But in the C++ code, we have:

https://github.com/apache/incubator-mxnet/blob/1eba37a8c3cb3a64efa0b52e79a3af7a6e7e5a57/src/operator/tensor/broadcast_reduce_op.h#L380-L401

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia):
(I'm using ...)

For Scala user, please provide:

Java version: (java -version)
Maven version: (mvn -version)
Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-05-15T02:18:57Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

reminisce · 2019-05-15T04:03:03Z

Thanks for reporting this. That was changed by mistake. I will revert this change for broadcast_to and add a unit test for it.

DickJC123 · 2019-05-15T17:26:14Z

Thank you for fixing this! Some additional suggestions:

If you allow both 0 and -1 for the output dimension 'wildcard', you probably should update the documentation to reflect this. Going forward, would you suggest people to use -1 in the shape, with the use of 0 being deprecated?
As you noticed, there's no unit test for this case. It was a number of the Sockeye unit tests that flushed out the issue for me. Even after I apply a trial fix to this problem, I still see errors in the Sockeye tests. I will be investigating these further, but you may want to try your fix against the Sockeye test suite as well.

reminisce · 2019-05-15T18:41:20Z

@DickJC123 Could you point me to the sockeye's specific failing test? Thanks.

DickJC123 · 2019-05-15T20:04:36Z

I've been running in the top-level sockeye directory python setup.py test. I saw 22 unit tests fail. After changing line 392 in the above-posted snippet to:

 if (oshape[i] != -1  && oshape[i] != 0) {

I'm down to 5 tests failing. I assume you're asking about these 5? They are:

test/unit/test_data_io.py::test_parallel_data_set FAILED                 [ 39%]
test/unit/test_data_io.py::test_sharded_parallel_sample_iter FAILED      [ 43%]
test/integration/test_seq_copy_int.py::test_seq_copy ... FAILED [ 97%]
test/integration/test_seq_copy_int.py::test_seq_copy ... FAILED [ 97%]
test/integration/test_seq_copy_int.py::test_seq_copy ...FAILED [ 97%]

In each case the error is a device_type == 0 (neither cpu nor gpu) on a read-in NDArray. It's reading in a file that has neither of the 2 magic numbers (so 'legacy' loading) and getting confused (I think) on the reading in of a shape.

DickJC123 · 2019-05-16T00:07:55Z

The 5 tests that fail do so because the test is trying to save and restore a 2D NDArray with a 0 size for one of the dimensions. I have not played with numpy compatibility modes for the tests, but in each case the last 3 of the following code lines is causing problems by converting for example (0,1) -> (-1,1):
https://github.com/apache/incubator-mxnet/blob/c5265fbf635965b5124df2d7dcea1b57c6508b73/src/ndarray/ndarray.cc#L1713-L1718

So you might say 'put the 5 tests in numpy_compatibility_mode.' However, I started thinking about the notion of TShape serialization. Ideally, the representation (and what it means) should not depend on the mode of the writer. Why not always save TShapes in the expanded 'numpy_compatibility mode' representation (unknown = -1, known-0 = 0, known-non-zero = N), regardless of the mode of the writer?

The code in ndarray.cc might then be:

void NDArray::Save(dmlc::Stream *strm) const {
...
  // save shape
  auto stored_shape = shape_;
  if (!Imperative::Get()->is_np_comp()) {
    common::ConvertToNumpyShape(&stored_shape);
  }
  stored_shape.Save(strm);
...
}

bool NDArray::Load(dmlc::Stream *strm) {
...
  // load shape
  mxnet::TShape shape;
  if (!shape.Load(strm)) return false;
  if (!Imperative::Get()->is_np_comp()) {
    common::ConvertToLegacyShape(&shape);
  }
...
}

reminisce · 2019-05-19T21:35:55Z

@DickJC123 Sorry for the late reply and thanks for the analysis and suggestion. I agree with what you proposed up there. The ndarrays in memory in the backend are indeed as what you described, i.e. -1 means unknown, 0 means known zero size, etc. The conversion to numpy shape in the NDArray's Save function was somehow missed to be implemented. Would you mind submitting a fix or if you are busy, I can take care of it? Thanks.

Another question on side, since zero-size tensors cannot be digested by any operators in MXNet without numpy compatibility mode, what's the purpose of saving zero-size tensors in sockeye's tests?

DickJC123 · 2019-05-22T03:38:42Z

Sorry for not getting to the review yet (sidetracked preparing issue 15034). During my review tomorrow, I'll be looking to see that a serialized NDArray (and its shapes) can be interpreted purely on the state of the file, without knowing the 'numpy compatibility mode' of the writer, and without the reader being in a particular mode. I'm not sure if this can be achieved with the current set of magic numbers.

I'm assuming there's value in being able to save NDArray's with both unknown and 0 dims and dim-sizes going forward.

reminisce · 2019-05-22T04:12:31Z

@DickJC123 The PR14998 only reverts the change on broadcast_to param shape without addressing the problem of loading zero-size ndarrays from files w/ or w/o numpy compatibility. For that, I think you already provided a feasible solution which keeps backward compatibility and we can go for that implementation, either you or I can do it. It's just not clear to me that why in sockeye's unit tests, zero-size tensors are saved since they cannot be in any operators without numpy compatibility.

DickJC123 · 2019-05-25T02:23:34Z

@reminisce I think you're really the one with the best 'big picture' of this new Shape facility, and where the definition is going. Perhaps you could attempt the fix for the remaining issues with Sockeye unittests? I picked the first test that I mentioned above and see that it is indeed trying to save/restore NDArrays with 0 dim-sizes:

https://github.com/awslabs/sockeye/blob/bb588ecbe874ae29ede33af2709e251910778bb3/test/unit/test_data_io.py#L255-L297

It looks like @fhieber has some involvement with Sockeye. Are you seeing unittest failures still, even after the latest PR (not yet merged) from @reminisce ?

reminisce · 2019-05-25T08:14:09Z

@DickJC123 I have submitted the PR reverting the changes in save/load functions. See #15073 for analysis. Could you please give a review? Thanks.

fhieber · 2019-05-27T06:27:52Z

Hi @DickJC123,
sorry for seeing this issue only now.
I just checked with latest nightly build of mxnet: all unit tests pass for latest sockeye. The issue I observed with broadcast_to (Usage in Sockeye here: https://github.com/awslabs/sockeye/blob/master/sockeye/layers.py#L378) was addressed.

reminisce · 2019-06-06T21:19:28Z

Fixed by
#15073
#14998

reminisce added the Bug label May 15, 2019

reminisce self-assigned this May 15, 2019

reminisce mentioned this issue May 20, 2019

Revert the change on broadcast_to param shape #14998

Merged

5 tasks

fhieber mentioned this issue May 23, 2019

broadcast_to is broken in latest MXNet #15052

Closed

reminisce mentioned this issue May 25, 2019

Restore save/load ndarray to 1.4.1 #15073

Merged

5 tasks

reminisce closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broadcast_to() behavior change after new 0D and uninitialized shape support added? #14954

broadcast_to() behavior change after new 0D and uninitialized shape support added? #14954

DickJC123 commented May 15, 2019

mxnet-label-bot commented May 15, 2019

reminisce commented May 15, 2019

DickJC123 commented May 15, 2019

reminisce commented May 15, 2019

DickJC123 commented May 15, 2019

DickJC123 commented May 16, 2019 •

edited

Loading

reminisce commented May 19, 2019 •

edited

Loading

DickJC123 commented May 22, 2019

reminisce commented May 22, 2019

DickJC123 commented May 25, 2019

reminisce commented May 25, 2019

fhieber commented May 27, 2019

reminisce commented Jun 6, 2019

broadcast_to() behavior change after new 0D and uninitialized shape support added? #14954

broadcast_to() behavior change after new 0D and uninitialized shape support added? #14954

Comments

DickJC123 commented May 15, 2019

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

mxnet-label-bot commented May 15, 2019

reminisce commented May 15, 2019

DickJC123 commented May 15, 2019

reminisce commented May 15, 2019

DickJC123 commented May 15, 2019

DickJC123 commented May 16, 2019 • edited Loading

reminisce commented May 19, 2019 • edited Loading

DickJC123 commented May 22, 2019

reminisce commented May 22, 2019

DickJC123 commented May 25, 2019

reminisce commented May 25, 2019

fhieber commented May 27, 2019

reminisce commented Jun 6, 2019

DickJC123 commented May 16, 2019 •

edited

Loading

reminisce commented May 19, 2019 •

edited

Loading