-
Notifications
You must be signed in to change notification settings - Fork 6.8k
broadcast_to() behavior change after new 0D and uninitialized shape support added? #14954
Comments
Hey, this is the MXNet Label Bot. |
Thanks for reporting this. That was changed by mistake. I will revert this change for |
Thank you for fixing this! Some additional suggestions:
|
@DickJC123 Could you point me to the sockeye's specific failing test? Thanks. |
I've been running in the top-level sockeye directory
I'm down to 5 tests failing. I assume you're asking about these 5? They are:
In each case the error is a device_type == 0 (neither cpu nor gpu) on a read-in NDArray. It's reading in a file that has neither of the 2 magic numbers (so 'legacy' loading) and getting confused (I think) on the reading in of a shape. |
The 5 tests that fail do so because the test is trying to save and restore a 2D NDArray with a 0 size for one of the dimensions. I have not played with numpy compatibility modes for the tests, but in each case the last 3 of the following code lines is causing problems by converting for example (0,1) -> (-1,1): So you might say 'put the 5 tests in numpy_compatibility_mode.' However, I started thinking about the notion of TShape serialization. Ideally, the representation (and what it means) should not depend on the mode of the writer. Why not always save TShapes in the expanded 'numpy_compatibility mode' representation (unknown = -1, known-0 = 0, known-non-zero = N), regardless of the mode of the writer? The code in ndarray.cc might then be:
|
@DickJC123 Sorry for the late reply and thanks for the analysis and suggestion. I agree with what you proposed up there. The ndarrays in memory in the backend are indeed as what you described, i.e. -1 means unknown, 0 means known zero size, etc. The conversion to numpy shape in the NDArray's Save function was somehow missed to be implemented. Would you mind submitting a fix or if you are busy, I can take care of it? Thanks. Another question on side, since zero-size tensors cannot be digested by any operators in MXNet without numpy compatibility mode, what's the purpose of saving zero-size tensors in sockeye's tests? |
Sorry for not getting to the review yet (sidetracked preparing issue 15034). During my review tomorrow, I'll be looking to see that a serialized NDArray (and its shapes) can be interpreted purely on the state of the file, without knowing the 'numpy compatibility mode' of the writer, and without the reader being in a particular mode. I'm not sure if this can be achieved with the current set of magic numbers. I'm assuming there's value in being able to save NDArray's with both unknown and 0 dims and dim-sizes going forward. |
@DickJC123 The PR14998 only reverts the change on broadcast_to param shape without addressing the problem of loading zero-size ndarrays from files w/ or w/o numpy compatibility. For that, I think you already provided a feasible solution which keeps backward compatibility and we can go for that implementation, either you or I can do it. It's just not clear to me that why in sockeye's unit tests, zero-size tensors are saved since they cannot be in any operators without numpy compatibility. |
@reminisce I think you're really the one with the best 'big picture' of this new Shape facility, and where the definition is going. Perhaps you could attempt the fix for the remaining issues with Sockeye unittests? I picked the first test that I mentioned above and see that it is indeed trying to save/restore NDArrays with 0 dim-sizes: It looks like @fhieber has some involvement with Sockeye. Are you seeing unittest failures still, even after the latest PR (not yet merged) from @reminisce ? |
@DickJC123 I have submitted the PR reverting the changes in save/load functions. See #15073 for analysis. Could you please give a review? Thanks. |
Hi @DickJC123, |
The documentation for broadcast_to(..., shape=<output_shape>,...) suggests that '0' can appear as a placeholder that means 'keep the same dimension as the input' for the given dimension. However, the C++ code requires -1 for this. Which is correct? Are we changing the behavior? @reminisce @KellenSunderland
Documentation:
But in the C++ code, we have:
https://github.com/apache/incubator-mxnet/blob/1eba37a8c3cb3a64efa0b52e79a3af7a6e7e5a57/src/operator/tensor/broadcast_reduce_op.h#L380-L401
For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io
Description
(Brief description of the problem in no more than 2 sentences.)
Environment info (Required)
Package used (Python/R/Scala/Julia):
(I'm using ...)
For Scala user, please provide:
java -version
)mvn -version
)scala -version
)For R user, please provide R
sessionInfo()
:Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash:
(Paste the output of
git rev-parse HEAD
here.)Build config:
(Paste the content of config.mk, or the build command.)
Error Message:
(Paste the complete error message, including stack trace.)
Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
What have you tried to solve it?
The text was updated successfully, but these errors were encountered: