Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-882] Support for N-d arrays added to diag op. #12430

Merged
merged 15 commits into from
Sep 18, 2018

Conversation

jasonyu1996
Copy link
Contributor

@jasonyu1996 jasonyu1996 commented Sep 1, 2018

Description

Github Issue #12327

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-12430/3/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Comments

  • This change is an extension to the existing interface, and therefore completely back compatible.

if (ishape.ndim() == 1) {
auto s = ishape[0] + std::abs(k);
return TShape({s, s});
}

auto h = ishape[0];
auto w = ishape[1];
auto h = ishape[axis1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to the previous comment, shouldnt we check that the provided axis1 and axis2 values are within the acceptable range [0:ndim-1] so that users dont get nasty segfaults on incorrect inputs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a check later on line 85/86, so maybe consider re-organizing the code so that this is looked up after the check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. These checks should be done first before using axis1 and axis2. Also I find it necessary to add some tests testing cases with negative axes.

@jasonyu1996
Copy link
Contributor Author

@samskalicky Thanks for your review! I have accordingly made modifications to the code.

@jasonyu1996
Copy link
Contributor Author

jasonyu1996 commented Sep 6, 2018

I don't understand what has happened. It looks quite unrelated to my modifications. It seems that several recent updates have failed at the same place.

======================================================================
ERROR: test_tvm_bridge.test_tvm_bridge
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/test_tvm_bridge.py", line 61, in test_tvm_bridge
    check(tgt, dtype)
  File "/work/mxnet/tests/python/gpu/test_tvm_bridge.py", line 46, in check
    f = tvm.build(s, [x, y, zz, scale])
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/build_module.py", line 501, in build
    mhost = codegen.build_module(fhost, str(target_host))
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/codegen.py", line 20, in build_module
    return _Build(lowered_func, target)
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/_ctypes/function.py", line 185, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib/python2.7/dist-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/base.py", line 66, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
TVMError: [04:14:55] /tmp/tvm/src/codegen/codegen.cc:27: Check failed: bf != nullptr Target llvm is not enabled

@samskalicky
Copy link
Contributor

samskalicky commented Sep 6, 2018

@jasonyu1996 regarding this error, I wonder if its transient. Can you try pushing an empty change and restart the CI pipeline? Its possible it will pass a 2nd time.

If that still doesnt work, can you try to debug it? Sorry, the message doesnt look very meaningful. Does that test fail if you run it yourself on your branch? If you run the same test in the master branch does it fail the same way? You might have to try re-applying some of your changes and see when it starts breaking to debug.

There are some instructions here to setup debugging (if you're using a Mac): https://cwiki.apache.org/confluence/display/MXNET/MXNet+Developer+Setup+on+Mac but im also working on improving this and putting together a debug guide so please post any steps or processes that you do (whether they worked or not) and i'll be sure to include those.

@jasonyu1996
Copy link
Contributor Author

jasonyu1996 commented Sep 7, 2018

@samskalicky Thanks! I also think the error would be transient. Possibly I should wait for a while until this problem gets solved. It has apparently not, because so far updates since noon of 6th (UTC) have all failed, even including one that changes only a markdown file, which has failed at exactly the same place as ours.

UPD: Somebody has opened an issue regarding this problem: #12473

@samskalicky
Copy link
Contributor

samskalicky commented Sep 7, 2018

@jasonyu1996 There is a PR to fix this issue dmlc/nnvm#525 and an associated change on mxnet #12479. I know that this issue is blocking everyone else so hopefully this gets through soon and we can get your changes merged too.

Thanks for all of your explanations on the code (especially the code you didnt write)! I hope you dont mind if I ask some more while we wait for the nnvm fix.

Also, can you explain the backward pass with the diagonal operator? What exactly is happening there, the diagonal comes in and the original input matrix gets updated? What is the ML meaning in the diag case; what error is there to correct in this calculation?

@jasonyu1996
Copy link
Contributor Author

@samskalicky Thanks! As what the diag operator does is basically copying some elements from the input directly to output, the backward process is quite straightforward. What it does is simply copying the gradients from the output back to the corresponding elements of the input. This is achieved with a template parameter indicating whether it is forward or backward to control the direction of the assignment in the copy process.

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samskalicky
Copy link
Contributor

@anirudh2290 Please review/merge

@jasonyu1996
Copy link
Contributor Author

@anirudh2290 @samskalicky Thanks!

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonyu1996 Thanks for your contribution. It looks good overall. I have a few questions and suggested some small changes. Thanks!

src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
src/operator/tensor/diag_op-inl.h Outdated Show resolved Hide resolved
@jasonyu1996
Copy link
Contributor Author

@apeforest Thanks for your meticulous review! I have made the change according to your suggestions.

@jasonyu1996
Copy link
Contributor Author

@samskalicky @apeforest Thanks again!

@jasonyu1996
Copy link
Contributor Author

@samskalicky @apeforest It seems that dim_t is poisonous in some cases. I have changed to following examples. k changed to int following the old diag implementation, axis1 and axis2 changed to int32_t following swapaxes and concat. I notice this is actually not very consistent in different operator implementations, and believe that if one day it is necessary to have a tensor of that many dimensions, these operators might have to go through some refactoring work.

@jasonyu1996
Copy link
Contributor Author

@apeforest @samskalicky Am I expected to do something more to get this PR merged?

@samskalicky
Copy link
Contributor

Sorry @jasonyu1996, i'll take a look at this today and give you a status update. Thanks for your patience!

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Many thanks for your contribution!

@lupesko
Copy link
Contributor

lupesko commented Sep 17, 2018

@sandeep-krishnamurthy @nswamy @anirudh2290 - looks like this one is ready to go, can you please review/merge? Thanks!

Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@sandeep-krishnamurthy sandeep-krishnamurthy merged commit 1e95bcd into apache:master Sep 18, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
* Support for N-d arrays added to diag op.

* Doc for diag updated. Sanity fix. Unit test for N-d diag op added.

* Unwanted print in diag test removed.

* Bad negative axis support in diag fixed.

* Bad negative axis support in diag fixed.

* Index overflow in diag fixed. Exemplars for Nd diag added.

* A bug in diag op fixed.

* Types of axis number and dim size changed to dim_t in diag.

* A bug in params of diag fixed.

* Poisonous dim_t removed from diag parameters.

* Diag impl details documented in comments.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants