Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[fix] missing input log higher order. #15331

Merged
merged 13 commits into from
Nov 19, 2019
16 changes: 8 additions & 8 deletions src/operator/tensor/elemwise_unary_op_basic.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1090,9 +1090,9 @@ MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_log,
unary_bwd<mshadow_op::log_grad>)
.set_attr<nnvm::FGradient>("FGradient",
[](const nnvm::NodePtr& n, const std::vector<nnvm::NodeEntry>& ograds) {
// ograds[0]: dL/dxgrad
// ograds[0]: dL/dygrad
// inputs[0]: dL/dy
// inputs[1]: x
// inputs[1]: x (ElemewiseGradUseIn)
// f(x) = y = log(x)
// f'(x) = 1/x
// f''(x) = -1 * (f'(x) * f'(x))
Expand All @@ -1117,15 +1117,15 @@ MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_log10,
unary_bwd<mshadow_op::log10_grad>)
.set_attr<nnvm::FGradient>("FGradient",
[](const nnvm::NodePtr& n, const std::vector<nnvm::NodeEntry>& ograds) {
// ograds[0]: dL/dxgrad
// ograds[0]: dL/dygrad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dL/dx_grad. The head gradient is the gradient with respect to the previous output right? the previous output is x_grad or dL/dx so this thing is dL/(dL/dx) or dL/dx_grad in lack of a better notation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should be, dL/dy_grad as we are computing/returning dL/dx_grad,
Eg.

y = f(dx_grad)
L = g(y) # dx_grad formed part of the network and affected loss

During backprop by chain rule,
dL/dx_grad = dL/dy * dy/dx_grad

In comments, we have called dL/dy (mentioned in the above example) as dL/dy_grad

That is why we have,
https://github.com/apache/incubator-mxnet/blob/5b95fb3ee3581ba20fe1def336621d68a811e17f/src/operator/tensor/elemwise_unary_op_basic.cc#L1111-L1112

These multiplications performing,

dL/dx_grad = dL/dy * dy/dx_grad

Copy link
Contributor

@larroy larroy Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the notation is complicating us in excess as it gets pretty hairy. It's the head gradient of the previous (output) node, which has shape of x, and x_grad. So it has to be related to x, not y.

I think in Lagrange notation it would be $$F_{L_x}$$ (derivative of some head function with respect to the derivative of the first loss wrt to x. (x_grad).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I get it now. If I understand it correctly then, crudely ograds[0] is how much does the x_grad affect the L and then we compute how does x_grad change with x. Makes sense now.

Thank you very much. Will reflect it in this and other PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kshitij12345 I think what you write makes sense. I'm also unsure about notations, maybe you can come with a better one. If not maybe we leave the comment out, so we can merge the PR, as the code seems to do what's needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks Again.

// inputs[0]: dL/dy
// inputs[1]: x
// inputs[1]: x (ElemewiseGradUseIn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice comment, helps.

// f(x) = y = log10(x)
// f'(x) = 1 / (log(10) * x)
// f''(x) = -1 * (f'(x) * 1/x)
auto dydx_mul_dldy = nnvm::NodeEntry{n}; // f'(x) * head_grads
auto dydx = MakeNode("elemwise_div", n->attrs.name + "_dydx",
{n->inputs[0]}, nullptr, &n);
{dydx_mul_dldy, n->inputs[0]}, nullptr, &n);
auto dlogx = MakeNode("reciprocal", n->attrs.name + "_dlogx",
{n->inputs[1]}, nullptr, &n);
auto d2ydx2_mid = MakeNode("elemwise_mul", n->attrs.name + "_d2ydx2_mid",
Expand All @@ -1146,15 +1146,15 @@ MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_log2,
unary_bwd<mshadow_op::log2_grad>)
.set_attr<nnvm::FGradient>("FGradient",
[](const nnvm::NodePtr& n, const std::vector<nnvm::NodeEntry>& ograds) {
// ograds[0]: dL/dxgrad
// ograds[0]: dL/dygrad
// inputs[0]: dL/dy
// inputs[1]: x
// inputs[1]: x (ElemewiseGradUseIn)
// f(x) = y = log2(x)
// f'(x) = 1 / (log(2) * x)
// f''(x) = -1 * (f'(x) * 1/x)
auto dydx_mul_dldy = nnvm::NodeEntry{n}; // f'(x) * head_grads
auto dydx = MakeNode("elemwise_div", n->attrs.name + "_dydx",
{n->inputs[0]}, nullptr, &n);
{dydx_mul_dldy, n->inputs[0]}, nullptr, &n);
auto dlogx = MakeNode("reciprocal", n->attrs.name + "_dlogx",
{n->inputs[1]}, nullptr, &n);
auto d2ydx2_mid = MakeNode("elemwise_mul", n->attrs.name + "_d2ydx2_mid",
Expand Down