[MXNET-798] Fix the dtype cast from non float32 in Gradient computation #12290

apeforest · 2018-08-22T17:28:49Z

Description

This PR fixes the issues #9067 and #8799 where gradient computation for operators with multiple output fails in ndarray if the dtype is not float32.

The root cause of the issue is that a _zeros operator was added for the other don't care output. The _zeros operator uses float32 dtype by default and it will cause conflict if the dtype in ndarray is not float32. My solution is to create a new _zeros_without_dtype operator that does not take any default dtype and use it to replace the _zeros operator in the computation graph. This change solves the dtype conflict problem and should be backward compatible.

A unit test is added to test this fix.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Change the way to infer type for auto-derived zero operator in nnvm::Graph
Added a unittest for operators with multioutput.

Comments

This seems to be a general problem for all multioutput operators when computing gradient in imperative mode. A simple example is copied from the original issue below:
Although the change is small, the impact could be large. Thus thorough review is solicited.

import mxnet as mx
from mxnet import autograd


data = mx.nd.arange(16, dtype='float64').reshape((4, 4))
data.attach_grad()

with autograd.record():
    y = mx.nd.split(data, axis=0, num_outputs=2)
y[0].backward()
print(data.grad)

apeforest · 2018-08-22T17:29:58Z

@eric-haibin-lin @piiswrong @haojin2 I will appreciate your review.

haojin2 · 2018-08-22T18:10:18Z

tests/python/unittest/test_infer_type.py

+
+
+if __name__ == "__main__":
+    test_infer_multiout_op()


I think this should go to something like test_operator.py instead of creating a separate file for it? And, please see https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L7017-L7018 for how to use nosetests.

haojin2 · 2018-08-22T18:10:39Z

tests/python/unittest/test_infer_type.py

+        test64.backward()
+    assert_almost_equal(data64.grad.asnumpy().all(), data32.grad.asnumpy().all())
+
+


I think this should go to something like test_operator.py instead of creating a separate file for it? And, please see https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L7017-L7018 for how to use nosetests.

This is not to test the functionality of the operator but a general type casting issue for all multioutput operators. I inclined to add it in the infer type tests but would like to hear more suggestions.

Changed test to run nose runmodule

apeforest · 2018-08-22T22:04:52Z

Change to [WIP] to fix some platform dependent unit test failure.

eric-haibin-lin · 2018-08-24T00:32:39Z

src/executor/infer_graph_attr_pass.cc

@@ -254,7 +254,8 @@ nnvm::Graph InferAttr(nnvm::Graph &&ret,
        dispatch_mode = &dispatch_modes[nid];
        if (dispatch_modes[nid] == DispatchMode::kUndefined) forward_known = false;
      }
-      auto finfer = finfer_shape.get(inode.source->op(), fdefault);
+      auto finfer = (inode.source->op() == Op::Get("_zeros")) ? fdefault :
+        finfer_shape.get(inode.source->op(), fdefault);


Are you sure about this? This affects all _zero ops, not just for the case you mentioned.

You are right, this is breaking some unit test (however, due to unittest of master branch is broken in MacOS, I wan't able to verify before checkin). I have changed the PR to WIP.

apeforest · 2018-09-12T16:39:19Z

@eric-haibin-lin Please review this new implementation. Thanks for your suggestion!

eric-haibin-lin · 2018-09-13T17:37:49Z

What's up with the build?

apeforest · 2018-09-13T18:02:03Z

@eric-haibin-lin Not sure exactly. An earlier build passed dcc5f78). After I renamed some variables the build on ARM7 failed. I can submit an empty change to trigger the build again.

eric-haibin-lin

lgtm

anirudh2290 · 2018-09-14T17:22:47Z

tests/python/unittest/test_infer_type.py

+    with autograd.record():
+        test64 = test_func(data64)
+        test64.backward()
+    assert_almost_equal(data64.grad.asnumpy().all(), data32.grad.asnumpy().all())


can you set rtol and atol to some bigger value than default here ?

Why increase the rtol and atol if the unit test can pass with the default one?

This can be flaky. you are comparing a float32 numpy to a float64 numpy and the atol and rtol defaults are small.

anirudh2290 · 2018-09-14T17:33:47Z

Also,maybe we should add zeros to APIs that may be good to break for 2.0 #9686

apeforest · 2018-09-14T18:43:56Z

@anirudh2290 The _zeros_without_dtype operator is a private operator used only in building nnvm graph. It is not meant to be exposed to users.

anirudh2290 · 2018-09-14T18:48:03Z

@apeforest what i meant is we can change the dtype default to -1 for zeros operator for 2.0.

apeforest · 2018-09-14T19:41:39Z

@anirudh2290 Thanks for the clarification. I have increased atol and rtol values in unit test. As to changing the dtype default to -1 for zeros, I think it is not related to this PR and may cause a backward compatibility issue with old models. Therefore, I would prefer doing that in a separate PR. Please let me know what you think. Thanks.

anirudh2290 · 2018-09-14T19:43:56Z

Not suggesting to do it in this PR. Just wanted to document it in the APIs to break for 2.0 and we can do it before 2.0 release.

…on (apache#12290) * Fix the dtype mismatch in derived _zeros node * Add unittest for infer dtype * Add one more unit test * Add nose runmodule * Add a zero operator with no default dtype * Rename variables * fix a bug: rename operator for gpu * Increase atol and rtol to avoid flakiness

apeforest added 4 commits August 22, 2018 09:49

Fix the dtype mismatch in derived _zeros node

f90ffaa

Merge remote-tracking branch 'upstream/master' into bugfix/dtype-cast

f375af2

Merge branch 'master' into bugfix/dtype-cast

e0122e1

Add unittest for infer dtype

a7da018

apeforest requested a review from anirudh2290 as a code owner August 22, 2018 17:28

Add one more unit test

1f15b4d

haojin2 reviewed Aug 22, 2018

View reviewed changes

Add nose runmodule

1677bd0

apeforest changed the title ~~[MXNET-798] Fix the dtype cast from non float32 in Gradient computation~~ [MXNET-798][WIP] Fix the dtype cast from non float32 in Gradient computation Aug 22, 2018

eric-haibin-lin reviewed Aug 24, 2018

View reviewed changes

apeforest added 3 commits August 29, 2018 12:43

Merge branch 'master' into bugfix/dtype-cast

0f5639a

Merge remote-tracking branch 'upstream/master' into bugfix/dtype-cast

e239f15

Add a zero operator with no default dtype

dcc5f78

apeforest changed the title ~~[MXNET-798][WIP] Fix the dtype cast from non float32 in Gradient computation~~ [MXNET-798] Fix the dtype cast from non float32 in Gradient computation Sep 12, 2018

apeforest added 3 commits September 12, 2018 09:59

Rename variables

9c09a2e

Merge remote-tracking branch 'upstream/master' into bugfix/dtype-cast

324390f

fix a bug: rename operator for gpu

b979ce5

Merge remote-tracking branch 'upstream/master' into bugfix/dtype-cast

b853440

eric-haibin-lin approved these changes Sep 14, 2018

View reviewed changes

apeforest mentioned this pull request Sep 14, 2018

how to convert parameters dtype from float32 to float64 in gluon? #9067

Closed

anirudh2290 reviewed Sep 14, 2018

View reviewed changes

Increase atol and rtol to avoid flakiness

5da3117

anirudh2290 merged commit 8209906 into apache:master Sep 14, 2018

szha mentioned this pull request Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

apeforest deleted the bugfix/dtype-cast branch January 7, 2020 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-798] Fix the dtype cast from non float32 in Gradient computation #12290

[MXNET-798] Fix the dtype cast from non float32 in Gradient computation #12290

apeforest commented Aug 22, 2018 •

edited

Loading

apeforest commented Aug 22, 2018 •

edited

Loading

haojin2 Aug 22, 2018

haojin2 Aug 22, 2018

apeforest Aug 22, 2018

apeforest Aug 22, 2018

apeforest commented Aug 22, 2018

eric-haibin-lin Aug 24, 2018

apeforest Aug 24, 2018

apeforest commented Sep 12, 2018

eric-haibin-lin commented Sep 13, 2018

apeforest commented Sep 13, 2018

eric-haibin-lin left a comment

anirudh2290 Sep 14, 2018

apeforest Sep 14, 2018

anirudh2290 Sep 14, 2018

anirudh2290 commented Sep 14, 2018

apeforest commented Sep 14, 2018

anirudh2290 commented Sep 14, 2018

apeforest commented Sep 14, 2018

anirudh2290 commented Sep 14, 2018

		test64.backward()
		assert_almost_equal(data64.grad.asnumpy().all(), data32.grad.asnumpy().all())

[MXNET-798] Fix the dtype cast from non float32 in Gradient computation #12290

[MXNET-798] Fix the dtype cast from non float32 in Gradient computation #12290

Conversation

apeforest commented Aug 22, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

apeforest commented Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Aug 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Sep 12, 2018

eric-haibin-lin commented Sep 13, 2018

apeforest commented Sep 13, 2018

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anirudh2290 commented Sep 14, 2018

apeforest commented Sep 14, 2018

anirudh2290 commented Sep 14, 2018

apeforest commented Sep 14, 2018

anirudh2290 commented Sep 14, 2018

apeforest commented Aug 22, 2018 •

edited

Loading

apeforest commented Aug 22, 2018 •

edited

Loading