You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the metric that is used in ModelCheckpoint reaches NaN in one epoch and then returns to a number in the following epoch, the model will not be saved as comparisons to NaN always return False.
Here a screenshot from my training:
Expected behavior
I think it would make sense to no save the model if the metric is NaN.
Environment
PyTorch Version (e.g., 1.0): 1.2.0
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): pip
Python version: 3.7
CUDA/cuDNN version: 10.1
The text was updated successfully, but these errors were encountered:
perhaps the training should stop altogether. if the loss is nan, it will propagate to the weights and the next steps/epochs are useless anyway. Lightning could stop the training/test/val loop and warn the user immediately.
🐛 Bug
When the metric that is used in
ModelCheckpoint
reachesNaN
in one epoch and then returns to a number in the following epoch, the model will not be saved as comparisons toNaN
always returnFalse
.Here a screenshot from my training:
Expected behavior
I think it would make sense to no save the model if the metric is
NaN
.Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: