-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[bug] fix higher grad log #15120
[bug] fix higher grad log #15120
Conversation
* bug: the head_grads were not preserved in higher order. * add test to validate the fix of the same.
@apeforest @larroy please review |
auto ggx = MakeNode("negative", n->attrs.name + "_backward_grad_grad", | ||
{nnvm::NodeEntry{ggx_mid}}, nullptr, &n); | ||
|
||
std::vector<nnvm::NodeEntry> ret; | ||
|
||
ret.emplace_back(MakeNode("elemwise_mul", n->attrs.name + "_backward_grad_grad", | ||
{ograds[0], gx}, nullptr, &n)); | ||
{ograds[0], nnvm::NodeEntry{g_lx}}, nullptr, &n)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble with head_grads.grad
which is being returned as 0's
(I guess they are somehow not being updated) while I expect it to be the output of this line.
Please help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. What do you mean by head_grads.grad? NodeEntry doesn't have a grad field. Could you clarify? Are you referring to the python code below? The gradient is always 0 when attach_grad() is called. The value is updated after running backward on an output, or using autograd.grad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking into this. The first output should be the gradient of y_grad. However, the head_grads.grad
does not get the value. I suspect the returned value from this function is dropped in the gradient calculation in imperative.cc. I will look more into this. Stay tuned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I forgot to add the line from the test file.
Sure waiting to know what you find.
Thanks for your contributions @kshitij12345 |
Hi @kshitij12345 thanks for looking into this. I think we need to clarify what exactly we have in the first parameter of FGradient "node". We were a bit puzzled with @apeforest looking at your PR. I validated the results with the tests but I think I tried only one log, don't remember which base. But the result seemed correct to me, I guess I missed this problem. Why do you say that node is ograd*f'(x)? the node argument I understand is the node to calculate the gradient for, in this case we are calculating the gradient of the backward of the log. So are you saying that by chain rule, the node is ograd(of log) * d (log(x)) / dx = ograd * reciprocal? Would be great if we could add this to the documentation, either to the FGradient typedef or to new_op. Otherwise I always have to dig through the code to refresh this. I think is poorly documented and tricky. |
|
||
# Validate the gradients. | ||
assert_almost_equal(expected_grad_grad, x.grad.asnumpy()) | ||
assert_almost_equal(expected_heads_grad, head_grads.grad.asnumpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand your question, i don't think anything is updating head_grads.grad here (this is done when running backward). Why do you want to set the head gradients manually? To verify your fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try
y_grad_grad = autograd.grad(y_grad, x, ..., create_graph = False...)[0]
and in validation
assert_almost_equal(expected_heads_grad, y_grad_grad.asnumpy())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah to verify the fix.
I expected y_grad.backward(head_grad_grads)
to update the head_grads.grad
similar to the Pytorch Script from the description.
Thanks for the suggestion,
I will surely try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y_grad.backward(head_grad_grads) indicate that head_grad_grads are the head gradients passed from "upstream". Calling (output variable).backward It will update all the independent input variables (from which those output are dependent), which have attached gradient. In this case head_grad_grads is not an input to the graph, so your problem that the grad doesn't get updated is expected:
https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/ndarray/ndarray.py#L2188
https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/autograd.py#L270
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are checking gradients for head_grads
(not head_grad_grads
), which is used to compute x_grad
, so I believe we should accumulate some gradient in head_grads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default behaviour in python vs mxnet is different with respect accumulation of gradients (pytorch: add) mxnet: write. Having said that, I still don't understand why do you expect gradient accumulation in head_grads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I just expected it to have gradients ( by accumulation or writing ), as it is / its value is used while computing the x_grad
. But from your and @apeforest's explanation, I kinda understand the behaviour better.
Thank You for digging in and explaining.
auto gx = nnvm::NodeEntry{n}; | ||
auto gx_mul_head_grads = nnvm::NodeEntry{n}; // f'(x) * head_grads | ||
auto head_grads = nnvm::NodeEntry{n->inputs[0]}; | ||
auto g_lx = MakeNode("reciprocal", n->attrs.name + "_backward_log_grad", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment about the inputs and what is g_lx? it would help reason about the code. Are the inputs of n (backward_log)
- 0: input gradient
- 1: x
?
So g_lx is a node having 1/x ? or the derivative of the log right? can we rename to g_logx ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing.
Following the unary_bwd which leads here. which fuses the Thus the node is multiplication of Also I had observed some weird behaviour for functions wrapped with
Yes it is. True there should be some documentation regarding the same. |
@kshitij12345 I have some question about the equation My understanding from the chain rule:
What is the meaning of dL/d y_grad? Are we treating y_grad as another input variable here? Many thanks for your clarification. |
I did some more probing. I think the reason that head_grads.grad being all zeros is the variable head_grads was not specified during the second backward pass. I updated the test as follows, but got an assertion error:
During the computation graph traversal, it complains that the variable head_grads is unreachable from the output which I think is reasonable. This again comes back to my question above, what is the mathematical meaning of head_grads.grad and why do we need this value? |
@apeforest see my reply above, head_grads is not in the graph so it's not updated during backward |
@larroy Yes, I agree with your reply. Also, I don't understand the meaning (or need) to return dL/dy_grad. @kshitij12345 Please comment. Thanks |
y_grad = autograd.grad(y, x, create_graph=True, retain_graph=True)[0] | ||
y_grad.backward() | ||
assert_almost_equal(expect_grad_grad.asnumpy(), x.grad.asnumpy()) | ||
y_grad = autograd.grad(y, x, head_grads=head_grads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is actually dL/dx, maybe rename it to x_grad
for better readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes. Will do that.
x.attach_grad() | ||
|
||
# Manual head_grads. | ||
head_grads = nd.random.normal(shape=x.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to y_grad
as it is dL/dy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
|
||
# Manual head_grads. | ||
head_grads = nd.random.normal(shape=x.shape) | ||
head_grad_grads = nd.random.normal(shape=x.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand what this variable is mathematically...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
head_grads
is just the input node in the graph for x_grad
.
head_grad_grads
is just to check the validity of the chain rule/backprop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification.
As per back accumulation of gradients and chain rule we always have the incoming gradient (also called head gradient or output gradient). So the second backward pass should calculate: I'm thinking that maybe the problem is that we should not reuse the head gradient from the first gradient in the second gradient. Shouldn't the two head gradients be independent variables? Let me know what you think. |
I was following on the basis of this graph that I had in my mind. Even I am not sure about the mathematical meaning of it. Also in usual scenario it would essentially be an intermediate node. Like in the picture. However since we are returning it, might as well test for it. And in case of the current test, I believe the graph for second order should be like this. I hope it makes sense. Thank You. |
# Compute expected values. | ||
expected_grad_grad = grad_grad_x.asnumpy() * head_grad_grads.asnumpy() * \ | ||
head_grads.asnumpy() | ||
expected_heads_grad = grad_x.asnumpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be grad_x.asnumpy() * head_grad_grads.asnumpy()
@kshitij12345 The computation graph for the second backward pass makes sense to me. As you can see there is only one output from the graph, that is x_grad_grad. It is not clear to me where the output |
Actually I haven't shown the other graph for If you focus only on The code below is what I have in my head (which passes the assertion for current code), def check_second_order_unary(x, op, grad_op, grad_grad_op):
x = nd.array(x)
grad_x = grad_op(x)
grad_grad_x = grad_grad_op(x)
x.attach_grad()
# Manual head_grads.
y_grad = nd.random.normal(shape=x.shape)
head_grad_grads = nd.random.normal(shape=x.shape)
y_grad.attach_grad()
# Perform compute.
with autograd.record():
y = op(x)
x_grad_mid = autograd.grad(y, x, head_grads=nd.ones_like(x),
create_graph=True, retain_graph=True)[0]
x_grad = x_grad_mid * y_grad # Note
x_grad.backward(head_grad_grads)
# Compute expected values.
expected_grad = grad_x.asnumpy() * y_grad.asnumpy() # Note
expected_grad_grad = grad_grad_x.asnumpy() * head_grad_grads.asnumpy() * \
y_grad.asnumpy()
expected_heads_grad = grad_x.asnumpy() * head_grad_grads.asnumpy()
# Validate the gradients.
assert_almost_equal(expected_grad, x_grad.asnumpy()) # Note
assert_almost_equal(expected_grad_grad, x.grad.asnumpy())
assert_almost_equal(expected_heads_grad, y_grad.grad.asnumpy())
x_grad_mid = autograd.grad(y, x, head_grads=nd.ones_like(y), create_graph=True, retain_graph=True)[0]
x_grad = x_grad_mid * y_grad # Note this part.
assert_almost_equal(expected_grad, x_grad.asnumpy()) # Passes
x_grad = autograd.grad(y, x, head_grads=y_grad,
create_graph=True, retain_graph=True)[0]
assert_almost_equal(expected_grad, x_grad.asnumpy()) # Passes So I expect 1. and 2. to have similar behaviour as both of them use As in the first case, My point being |
@kshitij12345 I think it's because of the design of Python backward API in MXNet. When you specify As in your case 2:
If you perform another backward on x_grad as As in your case 1:
You implicitly made y_grad an input variable when calling backward on x_grad. And that is why you will get values in y_grad.grad. I replaced the case 1.1: if I do the following, I again don't get any values for y_grad because the output only contains one gradient variable
case 1.2: I explicitly set y_grad as input variable, I then get the expected result as in your case 1
At this point, I am not sure if this is a bug because the backward API is designed differently from PyTorch. If y_grad is not specified as part of the input variables that need to perform gradient on, it will not get values assigned even if you write Thanks a lot for your careful drawing and insightful discussion. |
As a follow up, I just dumped out the computation graph in case 2. Indeed, the node that used to calculate y_grad.grad is not even in the final symbolic graph because there is no input dependency. |
Oh thank you very much for explaining what is happening. I get it now.
Also it would be really great if you can tell me how to get the dump of the graph that would be really helpful. Thank you again for your time and efforts. |
@apeforest can you paste your dump of the graph? and share where did you dump it? I was working on a utility to dump the graph. |
@kshitij12345 There is a utility function In one word, it's not an easy to use debugging utility and @larroy is working to improve it and create a PR soon. Stay tuned :) |
I don't have a clear answer to how to test this value now. Definitely not through a python unit test because the value is not getting output as NDArray. Besides, as I commented earlier, that computation node was even in nnvm graph maybe it was later deleted by graph optimization because it is a dangling node with no output. |
* remove assertion for y_grad gradient. * rename variables. * fix and update computation.
@apeforest , Thank You for explaining how to get the dump of the graph. Waiting the PR which simplifies that. Also could you tell me if it is possible to print the Node and it's value (for the array) via C++ and if yes then how . |
We don't yet have a utility function to do that. What do you mean the value of the node? A node only represent a computation in the graph. The values are passed in/out when the graph is traversed and each node invokes the FCompute/FGradient functions registered. I think we can only track values in the corresponding NDArray outputs. Please let me know if I misunderstand your question. |
Oops, sorry for the confusion. I meant the value of a NDArray corresponding to a computation. Thank You. |
Right now is very difficult to print the value of the NDArray. I will work on a utility to dump the graph this week, first the graph itself, and if possible the NDArray values. |
There is no available utility to print out value from NDArray. I used to write a for loop to iterate the |
y_grad = autograd.grad(y, x, create_graph=True, retain_graph=True)[0] | ||
y_grad.backward() | ||
assert_almost_equal(expect_grad_grad.asnumpy(), x.grad.asnumpy()) | ||
x_grad = autograd.grad(y, x, head_grads=y_grad, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explicitly specify the argument as
x_grad = autograd.grad(y, x, head_grads=y_grad, | |
x_grad = autograd.grad(heads=y, variables=x, head_grads=y_grad, create_graph=True, retain_graph=True) |
I think this will make it easier to understand.
@kshitij12345 Could you please rebase and retrigger CI again? Thanks! |
…to fix-higher-grad-log-bug
* explicitly pass arguments with name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's wait for the CI to pass.
@kshitij12345 still one GPU test failed. I looked at the log and don't find it related to your change. Could you please rebase and trigger CI one more time. The master branch was broken last weekend. Thanks again. |
…to fix-higher-grad-log-bug
@apeforest Note that the pending job has succeeded but for some reason it isn't updated here. |
@kshitij12345 There was some issue with CI recently. Could you please re-trigger it one more time? Sorry for the inconvenience. |
Co-Authored-By: Lin Yuan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much clearer, thanks. Nice variable naming.
auto gx = nnvm::NodeEntry{n, 0, 0}; | ||
auto g_lx = MakeNode("reciprocal", n->attrs.name + "_backward_log_grad", | ||
auto dydx_mul_dldy = nnvm::NodeEntry{n}; // f'(x) * head_grads | ||
auto dydx = MakeNode("elemwise_div", n->attrs.name + "_dydx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't elemwise_div require two inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it does. However I guess, this thing is skipped in the test computation graph and hence we don't see the error.
Will fix it. However we should somehow find a way to test for the same.
There is a bug in the implementation for higher order gradient of
log
.https://github.com/apache/incubator-mxnet/blob/7b343d1fcde73b61322985580080333d9eee9e82/src/operator/tensor/elemwise_unary_op_basic.cc#L1077-L1079
We multiply
gx * gx
wheregx = ograd * f'(x)
, gettingograd^2 * f'(x)^2
, however we want onlyograd * f'(x)^2
which can be achieved in a similar fashion to the implementation of_backward_log10/2
.I have validated the expected results for grad on
x
.Which fails with current code.
Have confirmed the behaviour with Pytorch as well.