Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with automatic_optimization=False #4295

Closed
catalys1 opened this issue Oct 21, 2020 · 5 comments
Closed

Problems with automatic_optimization=False #4295

catalys1 opened this issue Oct 21, 2020 · 5 comments
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers

Comments

@catalys1
Copy link
Contributor

🐛 Bug

When automatic_optimization = False and terminate_on_nan = True, an exception is raised when checking for nan values. This is due to None being passed in as the value for loss to self.detect_nan_tensors. It looks like the code on master has already changed from what I'm seeing in 1.0.3, so I don't know if this has somehow been fixed or not. The problem seems to be that the AttributeDict returned from train_step has loss=None.

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1qQmP6BwQk--rBXC7W45y0mn6QK39IPcc

Expected behavior

Don't crash when automatic_optimization = False and terminate_on_nan = True

@catalys1 catalys1 added bug Something isn't working help wanted Open to be worked on labels Oct 21, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@catalys1
Copy link
Contributor Author

I discovered this because the loss was showing up as nan in the progress bar, and I was trying to figure out why I was getting nan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are no nans. So there seems to be a problem in the logging somewhere, that if you're using automatic_optimization=False you get nan being logged as the loss in the progress bar.

@catalys1 catalys1 changed the title Problem with automatic_optimization=False and terminate_on_nan=True Problems with automatic_optimization=False Oct 21, 2020
@edenlightning edenlightning added logger Related to the Loggers duplicate This issue or pull request already exists labels Oct 22, 2020
@SeanNaren SeanNaren reopened this Nov 1, 2020
@SeanNaren SeanNaren removed the duplicate This issue or pull request already exists label Nov 1, 2020
@SeanNaren
Copy link
Contributor

Thanks @catalys1 you are correct, however recent changes should have resolved this issue since the nan check only runs if using automatic optimization:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L779-L789

In #4204 we'll make it clearer that you should report values within the training step via the docs :)

@GregorySenay
Copy link

Hi @catalys1,

in def training_step you can maybe overpass the nan issue by updating the running_loss directly:

self.trainer.train_loop.running_loss.append(loss)

in my case, no more nan whenautomatic_optimization=False

@Maddy12
Copy link

Maddy12 commented Mar 10, 2021

I am having the same issue, but I am also trying to add to the progress bar other logging scalars which are not showing at all.

As for printing the loss, @GregorySenay comment worked for me!

self.trainer.train_loop.running_loss.append(loss)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers
Projects
None yet
Development

No branches or pull requests

5 participants