Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan detection and intervention #1097

Merged
merged 24 commits into from
Mar 19, 2020
Merged

nan detection and intervention #1097

merged 24 commits into from
Mar 19, 2020

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Mar 8, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1008.

  • Detection of non-finite loss or parameter values
  • The check for nan or inf is done before optimization
  • When such values are detected, the training stops immediately.

Open question: what happens to the print_nan_grads Trainer argument? is it still useful?

  • For now, I removed the print_nan_grads argument and print them whenever nan is detected. Since training stops anyway, it wouldn't make sense to have this arg.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@awaelchli awaelchli changed the title nan detection and intervention nan detection and intervention [WIP] Mar 8, 2020
@Borda
Copy link
Member

Borda commented Mar 11, 2020

hey there, we have added GPU CI test, so could we kindly ask to rebase/merge master which will trigger these tests so we do not need to test it manually... Thx for your understanding 🤖

@pep8speaks
Copy link

pep8speaks commented Mar 17, 2020

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-18 12:37:26 UTC

@codecov
Copy link

codecov bot commented Mar 18, 2020

Codecov Report

Merging #1097 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #1097   +/-   ##
======================================
  Coverage      89%     89%           
======================================
  Files          62      62           
  Lines        3164    3173    +9     
======================================
+ Hits         2810    2820   +10     
+ Misses        354     353    -1     

@awaelchli awaelchli changed the title nan detection and intervention [WIP] nan detection and intervention Mar 18, 2020
@awaelchli
Copy link
Contributor Author

I think this is close to finished, just need review and core developers to comment/decide about the open question in the description above. Thanks.

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_tricks.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_tricks.py Outdated Show resolved Hide resolved
tests/test_cpu_models.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_tricks.py Show resolved Hide resolved
pytorch_lightning/trainer/training_tricks.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_tricks.py Outdated Show resolved Hide resolved
Adrian Wälchli and others added 5 commits March 18, 2020 13:29
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
@Borda Borda added bug Something isn't working feature Is an improvement or enhancement labels Mar 18, 2020
@Borda Borda added this to the 0.7.2 milestone Mar 18, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 Thx!

@Borda Borda added the ready PRs ready to be merged label Mar 18, 2020
@williamFalcon williamFalcon merged commit 732eaee into Lightning-AI:master Mar 19, 2020
@awaelchli awaelchli deleted the nan_detection branch April 1, 2020 02:46
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Apr 3, 2020
* check for nan values

* test nan detection on loss

* sys.exit

* whitespace

* detect nan and inf values in loss and params

* update

* added documentation

* moved detect nan to training loop, remove flag for print

* blank line

* test

* rename

* deprecate print_nan_grads

* deprecated print_nan_grads

* remove unused imports

* update changelog

* fix line too long

* correct deprecated version

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* raise exception instead of sysexit

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* raise exception instead of sysexit

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/training_tricks.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/training_tricks.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* fix test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
@Borda Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make metric-comparison in ModelCheckpoint robust to NaN
4 participants