[MXNET-882] Support for N-d arrays added to diag op. #12430

jasonyu1996 · 2018-09-01T10:23:50Z

Description

Github Issue #12327

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-12430/3/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

The diag operator now supports N-d arrays where N > 2. In this case it behaves in the same way as the diagonal operator provided by numpy (https://www.numpy.org/devdocs/reference/generated/numpy.diagonal.html). Axes of the sub-arrays can be specified with axis1 and axis2.

Comments

This change is an extension to the existing interface, and therefore completely back compatible.

src/operator/tensor/diag_op-inl.h

samskalicky · 2018-09-05T18:27:35Z

src/operator/tensor/diag_op-inl.h

  if (ishape.ndim() == 1) {
    auto s = ishape[0] + std::abs(k);
    return TShape({s, s});
  }

-  auto h = ishape[0];
-  auto w = ishape[1];
+  auto h = ishape[axis1];


related to the previous comment, shouldnt we check that the provided axis1 and axis2 values are within the acceptable range [0:ndim-1] so that users dont get nasty segfaults on incorrect inputs?

I see there is a check later on line 85/86, so maybe consider re-organizing the code so that this is looked up after the check

I think you are right. These checks should be done first before using axis1 and axis2. Also I find it necessary to add some tests testing cases with negative axes.

src/operator/tensor/diag_op-inl.h

jasonyu1996 · 2018-09-06T02:34:29Z

@samskalicky Thanks for your review! I have accordingly made modifications to the code.

jasonyu1996 · 2018-09-06T06:22:01Z

I don't understand what has happened. It looks quite unrelated to my modifications. It seems that several recent updates have failed at the same place.

======================================================================
ERROR: test_tvm_bridge.test_tvm_bridge
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/test_tvm_bridge.py", line 61, in test_tvm_bridge
    check(tgt, dtype)
  File "/work/mxnet/tests/python/gpu/test_tvm_bridge.py", line 46, in check
    f = tvm.build(s, [x, y, zz, scale])
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/build_module.py", line 501, in build
    mhost = codegen.build_module(fhost, str(target_host))
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/codegen.py", line 20, in build_module
    return _Build(lowered_func, target)
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/_ctypes/function.py", line 185, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/base.py", line 66, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
TVMError: [04:14:55] /tmp/tvm/src/codegen/codegen.cc:27: Check failed: bf != nullptr Target llvm is not enabled

samskalicky · 2018-09-06T16:52:17Z

@jasonyu1996 regarding this error, I wonder if its transient. Can you try pushing an empty change and restart the CI pipeline? Its possible it will pass a 2nd time.

If that still doesnt work, can you try to debug it? Sorry, the message doesnt look very meaningful. Does that test fail if you run it yourself on your branch? If you run the same test in the master branch does it fail the same way? You might have to try re-applying some of your changes and see when it starts breaking to debug.

There are some instructions here to setup debugging (if you're using a Mac): https://cwiki.apache.org/confluence/display/MXNET/MXNet+Developer+Setup+on+Mac but im also working on improving this and putting together a debug guide so please post any steps or processes that you do (whether they worked or not) and i'll be sure to include those.

src/operator/tensor/diag_op-inl.h

jasonyu1996 · 2018-09-07T02:22:59Z

@samskalicky Thanks! I also think the error would be transient. Possibly I should wait for a while until this problem gets solved. It has apparently not, because so far updates since noon of 6th (UTC) have all failed, even including one that changes only a markdown file, which has failed at exactly the same place as ours.

UPD: Somebody has opened an issue regarding this problem: #12473

samskalicky · 2018-09-07T21:00:21Z

@jasonyu1996 There is a PR to fix this issue dmlc/nnvm#525 and an associated change on mxnet #12479. I know that this issue is blocking everyone else so hopefully this gets through soon and we can get your changes merged too.

Thanks for all of your explanations on the code (especially the code you didnt write)! I hope you dont mind if I ask some more while we wait for the nnvm fix.

Also, can you explain the backward pass with the diagonal operator? What exactly is happening there, the diagonal comes in and the original input matrix gets updated? What is the ML meaning in the diag case; what error is there to correct in this calculation?

src/operator/tensor/diag_op-inl.h

jasonyu1996 · 2018-09-08T02:46:27Z

@samskalicky Thanks! As what the diag operator does is basically copying some elements from the input directly to output, the backward process is quite straightforward. What it does is simply copying the gradients from the output back to the corresponding elements of the input. This is achieved with a template parameter indicating whether it is forward or backward to control the direction of the assignment in the copy process.

samskalicky

LGTM

samskalicky · 2018-09-10T16:26:40Z

@anirudh2290 Please review/merge

jasonyu1996 · 2018-09-11T14:32:39Z

@anirudh2290 @samskalicky Thanks!

src/operator/tensor/diag_op-inl.h

apeforest

@jasonyu1996 Thanks for your contribution. It looks good overall. I have a few questions and suggested some small changes. Thanks!

src/operator/tensor/diag_op-inl.h

jasonyu1996 · 2018-09-12T02:22:20Z

@apeforest Thanks for your meticulous review! I have made the change according to your suggestions.

src/operator/tensor/diag_op-inl.h

jasonyu1996 · 2018-09-13T02:06:28Z

@samskalicky @apeforest Thanks again!

jasonyu1996 · 2018-09-13T13:42:00Z

@samskalicky @apeforest It seems that dim_t is poisonous in some cases. I have changed to following examples. k changed to int following the old diag implementation, axis1 and axis2 changed to int32_t following swapaxes and concat. I notice this is actually not very consistent in different operator implementations, and believe that if one day it is necessary to have a tensor of that many dimensions, these operators might have to go through some refactoring work.

jasonyu1996 · 2018-09-17T11:07:01Z

@apeforest @samskalicky Am I expected to do something more to get this PR merged?

samskalicky · 2018-09-17T16:08:22Z

Sorry @jasonyu1996, i'll take a look at this today and give you a status update. Thanks for your patience!

apeforest

LGTM. Many thanks for your contribution!

lupesko · 2018-09-17T22:48:34Z

@sandeep-krishnamurthy @nswamy @anirudh2290 - looks like this one is ready to go, can you please review/merge? Thanks!

sandeep-krishnamurthy

LGTM. Thanks.

* Support for N-d arrays added to diag op. * Doc for diag updated. Sanity fix. Unit test for N-d diag op added. * Unwanted print in diag test removed. * Bad negative axis support in diag fixed. * Bad negative axis support in diag fixed. * Index overflow in diag fixed. Exemplars for Nd diag added. * A bug in diag op fixed. * Types of axis number and dim size changed to dim_t in diag. * A bug in params of diag fixed. * Poisonous dim_t removed from diag parameters. * Diag impl details documented in comments.

Support for N-d arrays added to diag op.

ce4e902

jasonyu1996 requested a review from anirudh2290 as a code owner September 1, 2018 10:23

jasonyu1996 added 2 commits September 1, 2018 19:43

Doc for diag updated. Sanity fix. Unit test for N-d diag op added.

98adc98

Unwanted print in diag test removed.

b25080a

jasonyu1996 mentioned this pull request Sep 5, 2018

[Feature Request] support of diag for N-d arrays #12327

Closed

samskalicky reviewed Sep 5, 2018

View reviewed changes

src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved

samskalicky reviewed Sep 5, 2018

View reviewed changes

src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved

Bad negative axis support in diag fixed.

1004a49

samskalicky reviewed Sep 6, 2018

View reviewed changes

src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved

samskalicky reviewed Sep 6, 2018

View reviewed changes

src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved

samskalicky reviewed Sep 6, 2018

View reviewed changes

src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved

samskalicky reviewed Sep 7, 2018

View reviewed changes

src/operator/tensor/diag_op-inl.h Show resolved Hide resolved

jasonyu1996 added 2 commits September 8, 2018 11:16

Bad negative axis support in diag fixed.

128f89e

Merge branch 'diag' of https://github.com/jasonyu1996/mxnet into diag

a44149c

samskalicky approved these changes Sep 10, 2018

View reviewed changes

Merge branch 'master' into diag

44ce541

apeforest reviewed Sep 11, 2018

View reviewed changes

src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved

apeforest suggested changes Sep 11, 2018

View reviewed changes

Index overflow in diag fixed. Exemplars for Nd diag added.

785d95d

A bug in diag op fixed.

e672aed

samskalicky reviewed Sep 12, 2018

View reviewed changes