Nan loss #1068

AhmedThahir · 2024-10-18T15:08:41Z

AhmedThahir
Oct 18, 2024

I am getting nan loss from the first epoch.

import numpy as np
from torch import nn
import torch
from skorch import NeuralNetRegressor

X = torch.arange(0, 1_000, 1, dtype=torch.float32).reshape(-1, 1)
m = 2
y = m*X

class linearRegression(torch.nn.Module):
    def __init__(self, inputSize, outputSize):
        super(linearRegression, self).__init__()
        self.linear = torch.nn.Linear(inputSize, outputSize)

    def forward(self, x):
        return self.linear(x)

net = NeuralNetRegressor(
    linearRegression,
    module__inputSize=1,
    module__outputSize=1,
    max_epochs=10,
)
net.fit(X, y)

  epoch    train_loss     dur
-------  ------------  ------
      1           nan  1.5726
      2           nan  2.6100
      3           nan  1.9587
      4           nan  0.8001
      5           nan  0.6653
      6           nan  1.2568
      7           nan  1.1429
      8           nan  0.6611
      9           nan  0.6526
     10           nan  0.6949

Answered by BenjaminBossan

Oct 24, 2024

The reason why you're observing this is floating point arithmetic. Even though mathematically, the net's output should exactly correspond to y, in practice there are small differences. E.g. if the target is 127., the prediction is 126.9999. Given these small differences, the loss is non-zero, thus the parameters are changed a little bit. Normally, this should reduce the error but as the learning rate is too big, it actually increases, making this difference bigger and bigger after each update. If you change the learning rate to something smaller, like 1e-6 or 1e-7, you won't see this diverging behavior.

Note that regression uses MSE loss by default, which can be very sensitive to outliers…

View full answer

AhmedThahir · 2024-10-18T15:31:24Z

AhmedThahir
Oct 18, 2024
Author

Even if I do a near-perfect initialization, the loss comes out to be nan. Is this a bug in PyTorch and not an error from my side?

import numpy as np
from torch import nn
import torch
from skorch import NeuralNetRegressor

X = torch.arange(0, 1_000, 1, dtype=torch.float32).reshape(-1, 1)
m = 1
y = m*X

class linearRegression(torch.nn.Module):
    def __init__(self, inputSize, outputSize):
        super(linearRegression, self).__init__()
        self.linear = nn.Linear(inputSize, outputSize) # (torch.ones(size=(inputSize, outputSize), requires_grad=True))
        with torch.no_grad():  # Prevents gradient tracking
          self.linear.weight.data = torch.tensor([[0.999999]])
          self.linear.bias.data = torch.tensor([0.0])
    def forward(self, x):
        return self.linear(x)

net = NeuralNetRegressor(
    linearRegression,
    module__inputSize=1,
    module__outputSize=1,
    max_epochs=3,
    criterion=nn.MSELoss(),
    train_split=None,
    lr = 3e-4,
)
net.fit(X, y)

epoch                train_loss     dur
-------  ------------------------  ------
      1  7311519639803727872.0000  0.0237
      2           inf  0.0206
      3           nan  0.0150

4 replies

BenjaminBossan Oct 24, 2024
Maintainer

The reason why you're observing this is floating point arithmetic. Even though mathematically, the net's output should exactly correspond to y, in practice there are small differences. E.g. if the target is 127., the prediction is 126.9999. Given these small differences, the loss is non-zero, thus the parameters are changed a little bit. Normally, this should reduce the error but as the learning rate is too big, it actually increases, making this difference bigger and bigger after each update. If you change the learning rate to something smaller, like 1e-6 or 1e-7, you won't see this diverging behavior.

Note that regression uses MSE loss by default, which can be very sensitive to outliers on unnormalized losses. So for your actual task, normalize the target if possible (e.g. from -1 to 1) or consider a different criterion.

Answer selected by AhmedThahir

AhmedThahir Oct 24, 2024
Author

Thank you for such a clear and concise answer. I really appreciate it.

Learning rate: I referred to Andrej Karpathy's video where he said that 3e-4 is a good learning rate to start with. As you said, when I reduced the learning rate, it was resolved.
Normalization: Your point about normalizing the target makes sense, which would stabilize the MSE. I'm assuming then learning rate of 3e-4 should work?

I found this reference which goes along with what you have said:
https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/

BenjaminBossan Oct 24, 2024
Maintainer

Karpathy's guides are a really good starting point but you have to consider his advise in the right context. When he mentioned a learning rate of 3e-4, he was assuming that you use the Adam optimizer and that you have a task like language modeling. skorch uses the SGD optimizer by default and as mentioned, we have a regression task here, so the learning rate has to be chosen differently.

As a general tip, when you see that the loss diverges or only moves very slowly, changing the learning rate is among the first things you should check.

Regarding target scaling, consider using sklearn's TransformedTargetRegressor.

AhmedThahir Oct 24, 2024
Author

Thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan loss #1068

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Nan loss #1068

AhmedThahir Oct 18, 2024

Replies: 1 comment · 4 replies

AhmedThahir Oct 18, 2024 Author

BenjaminBossan Oct 24, 2024 Maintainer

AhmedThahir Oct 24, 2024 Author

BenjaminBossan Oct 24, 2024 Maintainer

AhmedThahir Oct 24, 2024 Author

AhmedThahir
Oct 18, 2024

Replies: 1 comment 4 replies

AhmedThahir
Oct 18, 2024
Author

BenjaminBossan Oct 24, 2024
Maintainer

AhmedThahir Oct 24, 2024
Author

BenjaminBossan Oct 24, 2024
Maintainer

AhmedThahir Oct 24, 2024
Author