make metric-comparison in ModelCheckpoint robust to NaN #1008

AljoSt · 2020-03-02T14:43:40Z

🐛 Bug

When the metric that is used in ModelCheckpoint reaches NaN in one epoch and then returns to a number in the following epoch, the model will not be saved as comparisons to NaN always return False.

Here a screenshot from my training:

Expected behavior

I think it would make sense to no save the model if the metric is NaN.

Environment

PyTorch Version (e.g., 1.0): 1.2.0
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): pip
Python version: 3.7
CUDA/cuDNN version: 10.1

The text was updated successfully, but these errors were encountered:

github-actions · 2020-03-02T14:44:15Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2020-03-02T22:56:52Z

perhaps the training should stop altogether. if the loss is nan, it will propagate to the weights and the next steps/epochs are useless anyway. Lightning could stop the training/test/val loop and warn the user immediately.

AljoSt · 2020-03-03T14:52:59Z

I don't know how, but my model returned to non-nan values (see screenshot). I have a suspicion that maybe apex had to play a role in this...

williamFalcon · 2020-03-06T23:56:45Z

@awaelchli i like that idea!
mind submitting a PR?

AljoSt added bug Something isn't working help wanted Open to be worked on labels Mar 2, 2020

williamFalcon added feature Is an improvement or enhancement and removed bug Something isn't working labels Mar 6, 2020

williamFalcon added this to the 0.7.1 milestone Mar 6, 2020

awaelchli mentioned this issue Mar 8, 2020

nan detection and intervention #1097

Merged

5 tasks

williamFalcon closed this as completed in #1097 Mar 19, 2020

ehsanmok mentioned this issue Jul 17, 2020

nan metric breaking ModelCheckpoint #2636

Closed

Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make metric-comparison in ModelCheckpoint robust to NaN #1008

make metric-comparison in ModelCheckpoint robust to NaN #1008

AljoSt commented Mar 2, 2020

github-actions bot commented Mar 2, 2020

awaelchli commented Mar 2, 2020

AljoSt commented Mar 3, 2020

williamFalcon commented Mar 6, 2020

make metric-comparison in ModelCheckpoint robust to NaN #1008

make metric-comparison in ModelCheckpoint robust to NaN #1008

Comments

AljoSt commented Mar 2, 2020

🐛 Bug

Expected behavior

Environment

github-actions bot commented Mar 2, 2020

awaelchli commented Mar 2, 2020

AljoSt commented Mar 3, 2020

williamFalcon commented Mar 6, 2020