Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

broadcast_to() behavior change after new 0D and uninitialized shape support added? #14954

Closed
DickJC123 opened this issue May 15, 2019 · 13 comments
Assignees
Labels

Comments

@DickJC123
Copy link
Contributor

The documentation for broadcast_to(..., shape=<output_shape>,...) suggests that '0' can appear as a placeholder that means 'keep the same dimension as the input' for the given dimension. However, the C++ code requires -1 for this. Which is correct? Are we changing the behavior? @reminisce @KellenSunderland

Documentation:

Broadcasting is allowed on axes with size 1, such as from (2,1,3,1) to (2,8,3,9). Elements will be duplicated on the broadcasted axes.  For example:

broadcast_to([[1,2,3]], shape=(2,3)) = [[ 1.,  2.,  3.],
                                        [ 1.,  2.,  3.]])

The dimension which you do not want to change can also be kept as 0 which means copy the original value. So with shape=(2,0), we will obtain the same result as in the above example.

Defined in src/operator/tensor/broadcast_reduce_op_value.cc:L262

But in the C++ code, we have:

https://github.com/apache/incubator-mxnet/blob/1eba37a8c3cb3a64efa0b52e79a3af7a6e7e5a57/src/operator/tensor/broadcast_reduce_op.h#L380-L401

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia):
(I'm using ...)

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

@reminisce
Copy link
Contributor

Thanks for reporting this. That was changed by mistake. I will revert this change for broadcast_to and add a unit test for it.

@reminisce reminisce added the Bug label May 15, 2019
@reminisce reminisce self-assigned this May 15, 2019
@DickJC123
Copy link
Contributor Author

Thank you for fixing this! Some additional suggestions:

  • If you allow both 0 and -1 for the output dimension 'wildcard', you probably should update the documentation to reflect this. Going forward, would you suggest people to use -1 in the shape, with the use of 0 being deprecated?
  • As you noticed, there's no unit test for this case. It was a number of the Sockeye unit tests that flushed out the issue for me. Even after I apply a trial fix to this problem, I still see errors in the Sockeye tests. I will be investigating these further, but you may want to try your fix against the Sockeye test suite as well.

@reminisce
Copy link
Contributor

@DickJC123 Could you point me to the sockeye's specific failing test? Thanks.

@DickJC123
Copy link
Contributor Author

I've been running in the top-level sockeye directory python setup.py test. I saw 22 unit tests fail. After changing line 392 in the above-posted snippet to:

 if (oshape[i] != -1  && oshape[i] != 0) {

I'm down to 5 tests failing. I assume you're asking about these 5? They are:

test/unit/test_data_io.py::test_parallel_data_set FAILED                 [ 39%]
test/unit/test_data_io.py::test_sharded_parallel_sample_iter FAILED      [ 43%]
test/integration/test_seq_copy_int.py::test_seq_copy ... FAILED [ 97%]
test/integration/test_seq_copy_int.py::test_seq_copy ... FAILED [ 97%]
test/integration/test_seq_copy_int.py::test_seq_copy ...FAILED [ 97%]

In each case the error is a device_type == 0 (neither cpu nor gpu) on a read-in NDArray. It's reading in a file that has neither of the 2 magic numbers (so 'legacy' loading) and getting confused (I think) on the reading in of a shape.

@DickJC123
Copy link
Contributor Author

DickJC123 commented May 16, 2019

The 5 tests that fail do so because the test is trying to save and restore a 2D NDArray with a 0 size for one of the dimensions. I have not played with numpy compatibility modes for the tests, but in each case the last 3 of the following code lines is causing problems by converting for example (0,1) -> (-1,1):
https://github.com/apache/incubator-mxnet/blob/c5265fbf635965b5124df2d7dcea1b57c6508b73/src/ndarray/ndarray.cc#L1713-L1718

So you might say 'put the 5 tests in numpy_compatibility_mode.' However, I started thinking about the notion of TShape serialization. Ideally, the representation (and what it means) should not depend on the mode of the writer. Why not always save TShapes in the expanded 'numpy_compatibility mode' representation (unknown = -1, known-0 = 0, known-non-zero = N), regardless of the mode of the writer?

The code in ndarray.cc might then be:

void NDArray::Save(dmlc::Stream *strm) const {
...
  // save shape
  auto stored_shape = shape_;
  if (!Imperative::Get()->is_np_comp()) {
    common::ConvertToNumpyShape(&stored_shape);
  }
  stored_shape.Save(strm);
...
}

bool NDArray::Load(dmlc::Stream *strm) {
...
  // load shape
  mxnet::TShape shape;
  if (!shape.Load(strm)) return false;
  if (!Imperative::Get()->is_np_comp()) {
    common::ConvertToLegacyShape(&shape);
  }
...
} 

@reminisce
Copy link
Contributor

reminisce commented May 19, 2019

@DickJC123 Sorry for the late reply and thanks for the analysis and suggestion. I agree with what you proposed up there. The ndarrays in memory in the backend are indeed as what you described, i.e. -1 means unknown, 0 means known zero size, etc. The conversion to numpy shape in the NDArray's Save function was somehow missed to be implemented. Would you mind submitting a fix or if you are busy, I can take care of it? Thanks.

Another question on side, since zero-size tensors cannot be digested by any operators in MXNet without numpy compatibility mode, what's the purpose of saving zero-size tensors in sockeye's tests?

@DickJC123
Copy link
Contributor Author

Sorry for not getting to the review yet (sidetracked preparing issue 15034). During my review tomorrow, I'll be looking to see that a serialized NDArray (and its shapes) can be interpreted purely on the state of the file, without knowing the 'numpy compatibility mode' of the writer, and without the reader being in a particular mode. I'm not sure if this can be achieved with the current set of magic numbers.

I'm assuming there's value in being able to save NDArray's with both unknown and 0 dims and dim-sizes going forward.

@reminisce
Copy link
Contributor

@DickJC123 The PR14998 only reverts the change on broadcast_to param shape without addressing the problem of loading zero-size ndarrays from files w/ or w/o numpy compatibility. For that, I think you already provided a feasible solution which keeps backward compatibility and we can go for that implementation, either you or I can do it. It's just not clear to me that why in sockeye's unit tests, zero-size tensors are saved since they cannot be in any operators without numpy compatibility.

@DickJC123
Copy link
Contributor Author

@reminisce I think you're really the one with the best 'big picture' of this new Shape facility, and where the definition is going. Perhaps you could attempt the fix for the remaining issues with Sockeye unittests? I picked the first test that I mentioned above and see that it is indeed trying to save/restore NDArrays with 0 dim-sizes:

https://github.com/awslabs/sockeye/blob/bb588ecbe874ae29ede33af2709e251910778bb3/test/unit/test_data_io.py#L255-L297

It looks like @fhieber has some involvement with Sockeye. Are you seeing unittest failures still, even after the latest PR (not yet merged) from @reminisce ?

@reminisce
Copy link
Contributor

@DickJC123 I have submitted the PR reverting the changes in save/load functions. See #15073 for analysis. Could you please give a review? Thanks.

@fhieber
Copy link
Contributor

fhieber commented May 27, 2019

Hi @DickJC123,
sorry for seeing this issue only now.
I just checked with latest nightly build of mxnet: all unit tests pass for latest sockeye. The issue I observed with broadcast_to (Usage in Sockeye here: https://github.com/awslabs/sockeye/blob/master/sockeye/layers.py#L378) was addressed.

@reminisce
Copy link
Contributor

Fixed by
#15073
#14998

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants