Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make metric-comparison in ModelCheckpoint robust to NaN #1008

Closed
AljoSt opened this issue Mar 2, 2020 · 4 comments · Fixed by #1097
Closed

make metric-comparison in ModelCheckpoint robust to NaN #1008

AljoSt opened this issue Mar 2, 2020 · 4 comments · Fixed by #1097
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@AljoSt
Copy link

AljoSt commented Mar 2, 2020

🐛 Bug

When the metric that is used in ModelCheckpoint reaches NaN in one epoch and then returns to a number in the following epoch, the model will not be saved as comparisons to NaN always return False.

Here a screenshot from my training:

Screenshot from 2020-03-02 14-31-46

Expected behavior

I think it would make sense to no save the model if the metric is NaN.

Environment

  • PyTorch Version (e.g., 1.0): 1.2.0
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.7
  • CUDA/cuDNN version: 10.1
@AljoSt AljoSt added bug Something isn't working help wanted Open to be worked on labels Mar 2, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2020

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Contributor

perhaps the training should stop altogether. if the loss is nan, it will propagate to the weights and the next steps/epochs are useless anyway. Lightning could stop the training/test/val loop and warn the user immediately.

@AljoSt
Copy link
Author

AljoSt commented Mar 3, 2020

I don't know how, but my model returned to non-nan values (see screenshot). I have a suspicion that maybe apex had to play a role in this...

@williamFalcon
Copy link
Contributor

@awaelchli i like that idea!
mind submitting a PR?

@williamFalcon williamFalcon added feature Is an improvement or enhancement and removed bug Something isn't working labels Mar 6, 2020
@williamFalcon williamFalcon added this to the 0.7.1 milestone Mar 6, 2020
@Borda Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants