Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

sym.Sqrt gradient inf ? #2261

Closed
juliandewit opened this issue May 27, 2016 · 8 comments
Closed

sym.Sqrt gradient inf ? #2261

juliandewit opened this issue May 27, 2016 · 8 comments

Comments

@juliandewit
Copy link

juliandewit commented May 27, 2016

Hello, I am experimenting a bit with composing layers..
Now my network is a bit unstable. Using the monitor I get "inf" for the gradient of a sqrt symbol.

Now I look at the gradient pass of the sqrt symbol and I see:
struct square_root_grad { template<typename DType> MSHADOW_XINLINE static DType Map(DType a) { return DType(DType(0.5f) / a); } };

Now I'm not 100% sure of how everything works but doesn't this pose a problem if incoming gradient (a) == 0
?

It would explain my issues..

@juliandewit
Copy link
Author

juliandewit commented May 27, 2016

Hmmm it's even more strange.
I set minibatch size to 1

The gradient comes from a reshape symbol. Shape is (1,1)
So there is only one value which is 0.49....
As I understand this flows into the sqrt and then the gradient (also shape (1,1) ) is 99.931.

See log below:

INFO:root:Batch:       1 reshape3_backward_data         0.499396    
INFO:root:Batch:       1 sqrt1_backward_data            99.931  

Am I not understanding something here ?
I would expect 0.5 / 0.499 ~ 1.

Doing the reshape before the sqrt gave same result.

@sxjscience
Copy link
Member

@juliandewit Could you provide a code example? In that case I can replicate the error and debug locally.

@juliandewit
Copy link
Author

The project as-is is too complex but I'll try to come up with something more isolated.

@sxjscience
Copy link
Member

Great!

@juliandewit
Copy link
Author

juliandewit commented May 27, 2016

I have an example that gives another problem but that might explain my problems..
It most probably does not work but I look at the 1st monitor output and it confuses me.

Minibatchsize = 1 so all shapes are (1,1)
Y = 0.071667
output = almost zero
square and sqrt output look correct.
Makeloss output is '0' which seems strange to me but could be...
Makeloss gradient is '1' which also seems strange but could be.
However now sqrt gradient is '6.974...' which I just do not understand.

Am I doing something wrong ? Am I using the software incorrectly ?

custom_loss.txt

monitor1.txt

Thanks in advance..

@sxjscience
Copy link
Member

I've checked the code. 0.0716871 here means the loss function, which is the output of a sqrt Op. The gradient is thus 0.5/0.0716871 = 6.974.

Also, I've found a bug in makeloss op in https://github.com/dmlc/mxnet/blob/master/src/operator/make_loss-inl.h#L55 that causes Makeloss output to be zero. Since directly using A=B will not copy the value (dmlc/mshadow#50), we need to use 1.0f * data . I'll make a PR for this.

@piiswrong
Copy link
Contributor

The gradient of sqrt at 0 is inf so it's not a stable loss function.
Use sqrt(1+relu(x)) to get a stable loss

@juliandewit
Copy link
Author

juliandewit commented May 28, 2016

Thanks.
Ok conclusion.
Gradient of SQRT is working as expected.
Only when incoming gradient == 0 it's not stable..
So I will have to stabilize.
I will download makeloss fix to be sure.
Closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants