Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MKLDNN] Independent gradients requests check with respect to weights and bias of convolution #15497

Merged
merged 13 commits into from
Jul 16, 2019
Merged

Conversation

zixuanweeei
Copy link
Contributor

@zixuanweeei zixuanweeei commented Jul 9, 2019

Description

As it was described in #15464, MXNet with MKL-DNN gives a wrong gradient of a convolution with respect to its biases unless the gradient with respect to its weights is also requested. In the implement of convolution with MKL-DNN, only request for gradients with respect to weights is checked. It should be checked independently for bias.

Checklist

Changes

  • Independently check the requests for gradients with respect to weights and bias of convolution.
  • Convolution can give the gradient with respect to any one of weights or bias.

Comments

  • No comments.

@zixuanweeei
Copy link
Contributor Author

@pengzhao-intel @ciyongch @TaoLv Please help me review on this PR. Thanks. 😃

@pengzhao-intel
Copy link
Contributor

Could you try to add a UT for this case?

@zixuanweeei
Copy link
Contributor Author

zixuanweeei commented Jul 9, 2019

Could you try to add a UT for this case?

Sure. The existent UT passed with this PR on a local test. I will add some UTs for checking the correctness of the results of the gradients' requests combinations.

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Just one minor comment. Seems the CI is stuck, please try to re-trigger it.

MKLDNNStream::Get()->RegisterPrim(convBwdWeight.GetBwdWeights());
CommitOutput(in_grad[conv::kBias], in_grad_bias);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to check req[conv::kWeight] here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I see. There is unnecessary primitive registration without the check enabled. Thanks.

@TaoLv
Copy link
Member

TaoLv commented Jul 10, 2019

@matteosal Could you help to verify this PR with the test case in your project?

}
CommitOutput(in_grad[conv::kWeight], in_grad_weight);
if (req[conv::kWeight]) CommitOutput(in_grad[conv::kWeight], in_grad_weight);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the behavior of req[conv::bias]?

Copy link
Contributor Author

@zixuanweeei zixuanweeei Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has the same behavior as req[kWeight]. Both of them return the operation request type (OpReqType) to Forward and Backward. We can use it to control the behavior of handling memory of result, like add/copy the result back to the source memory or just do nothing with them.

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to reorg the code logic to avoid multiple if/if-else structure which leads the bad readability.

@pengzhao-intel pengzhao-intel changed the title Independent gradients requests check with respect to weights and bias of convolution [MKLDNN] Independent gradients requests check with respect to weights and bias of convolution Jul 11, 2019
@matteosal
Copy link
Contributor

The example of #15464 is fixed here, but I see a failure with this one, where the weights gradient is requested in isolation (so the opposite of #15464 ):

import mxnet as mx

sym = mx.sym.Convolution(
	mx.sym.Variable('in'), 
	mx.sym.Variable('w'), 
	mx.sym.Variable('b'),
	kernel=(1, 1), 
	num_filter=1
)
args = {
	'in': mx.nd.ones([1, 1, 3, 3]),
	'w': mx.nd.ones([1, 1, 1, 1]),
	'b': mx.nd.ones([1]),
}
grad = {
	'in': mx.nd.zeros([1, 1, 3, 3]),
	'w': mx.nd.zeros([1, 1, 1, 1]),
	'b': mx.nd.zeros([1]),
}
req = {'in': 'null', 'w': 'write', 'b': 'null'}
outgrad = mx.nd.ones([1, 1, 3, 3])

ex = sym.bind(mx.cpu(), args, args_grad=grad, grad_req=req)

ex.forward(True);
ex.backward(out_grads=outgrad);
mx.ndarray.waitall()

This is what gets printed to command line:

Traceback (most recent call last):
  File "script2.py", line 27, in <module>
    mx.ndarray.waitall()
  File "/home/matteo/Git/mxnet/python/mxnet/ndarray/ndarray.py", line 166, in waitall
    check_call(_LIB.MXNDArrayWaitAll())
  File "/home/matteo/Git/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: std::exception

It doesn't fail on master

@zixuanweeei
Copy link
Contributor Author

zixuanweeei commented Jul 12, 2019

@matteosal I tested your example with commit 9ca0428, and it ran successfully without any exception. Could you test it again with commit 9ca0428?

@matteosal
Copy link
Contributor

@matteosal I tested your example with commit 9ca0428, and it ran successfully without any exception. Could you test it again with commit 9ca0428?

Ops sorry, I've missed that commit. Yes, it works on that

@zixuanweeei
Copy link
Contributor Author

@matteosal Thanks. That's great. I am working on a unit test for this feature. Then we can merge it to master after further review and verification.

@zixuanweeei
Copy link
Contributor Author

@pengzhao-intel @TaoLv Please re-review on this PR. It should be noted that the new unit test function will be unevaluated in context of GPU because of the possible precision degradation resulted from the autotuned cudnn convolution. From a local test, the autotuned convolution has no more than 1.0% mismatches compared to a non-autotuned one, when any of the gradient request is set to be null (atol=1e-3, rtol=1e-3).

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Jul 14, 2019

the possible precision degradation resulted from the autotuned cudnn convolution.

How much degrade from GPU? Could we set a low bar for GPU?

@zixuanweeei
Copy link
Contributor Author

I will take some tests to see whether a low bar works.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Jul 15, 2019

@TaoLv @ciyongch please take a review too.

@karan6181
Copy link
Contributor

@mxnet-label-bot add [MKLDNN, pr-awaiting-review]

@marcoabreu marcoabreu added MKLDNN pr-awaiting-review PR is waiting for code review labels Jul 15, 2019
Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good :)
One more question: does the gpu precision drop happen in both forward and backward?

for var_name in var_names:
if var_name == "b" and no_bias:
continue
if grad_req2[var_name] == "null":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have such case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. It is a very corner use of only requesting the gradient with respect to bias.

@zixuanweeei
Copy link
Contributor Author

@ciyongch The equality assertion error happened with the outputs of convolution forward when any of the gradients requests (x, w, b) was null. Not sure about whether the Backward process drops the numerical precision.

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's fine to keep the same tolerant for both forward and backward outputs.

@pengzhao-intel
Copy link
Contributor

Thanks for your contribution. Merging now :)

@pengzhao-intel pengzhao-intel merged commit 1b725c3 into apache:master Jul 16, 2019
juliusshufan pushed a commit to juliusshufan/incubator-mxnet that referenced this pull request Aug 8, 2019
… and bias of convolution (apache#15497)

* Independent req[kBias] and req[kWeight] check

* Add UT for independent conv gradient requests

* Update conv independent grad UT with no_bias enabled

* Check req[kWeight] for avoiding unnecessary prim registration

* Check `OpReqTpye` in CommitOutput automatically

* Lock cudnn autotune for accurate conv output

* Ignore independent gradients test on GPU

* Trigger CI

* Sets a low bar for autotuned cudnn convolution
juliusshufan pushed a commit to juliusshufan/incubator-mxnet that referenced this pull request Aug 8, 2019
… and bias of convolution (apache#15497)

* Independent req[kBias] and req[kWeight] check

* Add UT for independent conv gradient requests

* Update conv independent grad UT with no_bias enabled

* Check req[kWeight] for avoiding unnecessary prim registration

* Check `OpReqTpye` in CommitOutput automatically

* Lock cudnn autotune for accurate conv output

* Ignore independent gradients test on GPU

* Trigger CI

* Sets a low bar for autotuned cudnn convolution
juliusshufan pushed a commit to juliusshufan/incubator-mxnet that referenced this pull request Aug 11, 2019
… and bias of convolution (apache#15497)

* Independent req[kBias] and req[kWeight] check

* Add UT for independent conv gradient requests

* Update conv independent grad UT with no_bias enabled

* Check req[kWeight] for avoiding unnecessary prim registration

* Check `OpReqTpye` in CommitOutput automatically

* Lock cudnn autotune for accurate conv output

* Ignore independent gradients test on GPU

* Trigger CI

* Sets a low bar for autotuned cudnn convolution
juliusshufan pushed a commit to juliusshufan/incubator-mxnet that referenced this pull request Aug 11, 2019
… and bias of convolution (apache#15497)

* Independent req[kBias] and req[kWeight] check

* Add UT for independent conv gradient requests

* Update conv independent grad UT with no_bias enabled

* Check req[kWeight] for avoiding unnecessary prim registration

* Check `OpReqTpye` in CommitOutput automatically

* Lock cudnn autotune for accurate conv output

* Ignore independent gradients test on GPU

* Trigger CI

* Sets a low bar for autotuned cudnn convolution
juliusshufan pushed a commit to juliusshufan/incubator-mxnet that referenced this pull request Aug 12, 2019
… and bias of convolution (apache#15497)

* Independent req[kBias] and req[kWeight] check

* Add UT for independent conv gradient requests

* Update conv independent grad UT with no_bias enabled

* Check req[kWeight] for avoiding unnecessary prim registration

* Check `OpReqTpye` in CommitOutput automatically

* Lock cudnn autotune for accurate conv output

* Ignore independent gradients test on GPU

* Trigger CI

* Sets a low bar for autotuned cudnn convolution
TaoLv pushed a commit that referenced this pull request Aug 13, 2019
…o weights… (#15805)

* [MKLDNN] Independent gradients requests check with respect to weights and bias of convolution (#15497)

* Independent req[kBias] and req[kWeight] check

* Add UT for independent conv gradient requests

* Update conv independent grad UT with no_bias enabled

* Check req[kWeight] for avoiding unnecessary prim registration

* Check `OpReqTpye` in CommitOutput automatically

* Lock cudnn autotune for accurate conv output

* Ignore independent gradients test on GPU

* Trigger CI

* Sets a low bar for autotuned cudnn convolution

* [Flaky test] Skip test_operator_gpu.test_convolution_independent_gradients (#15631)

* Skip test_convolution_independent_gradirents

* Add an issue link

* Fix inconsistent context of input array and binding op

* Trigger CI

* Retrigger CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
MKLDNN pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants