-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan detection and intervention #1097
Conversation
hey there, we have added GPU CI test, so could we kindly ask to rebase/merge master which will trigger these tests so we do not need to test it manually... Thx for your understanding 🤖 |
Hello @awaelchli! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-03-18 12:37:26 UTC |
Codecov Report
@@ Coverage Diff @@
## master #1097 +/- ##
======================================
Coverage 89% 89%
======================================
Files 62 62
Lines 3164 3173 +9
======================================
+ Hits 2810 2820 +10
+ Misses 354 353 -1 |
I think this is close to finished, just need review and core developers to comment/decide about the open question in the description above. Thanks. |
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀 Thx!
* check for nan values * test nan detection on loss * sys.exit * whitespace * detect nan and inf values in loss and params * update * added documentation * moved detect nan to training loop, remove flag for print * blank line * test * rename * deprecate print_nan_grads * deprecated print_nan_grads * remove unused imports * update changelog * fix line too long * correct deprecated version Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * raise exception instead of sysexit Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * raise exception instead of sysexit Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/training_tricks.py Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/training_tricks.py Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * fix test Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Before submitting
What does this PR do?
Fixes #1008.
Open question: what happens to the
print_nan_grads
Trainer argument? is it still useful?print_nan_grads
argument and print them whenever nan is detected. Since training stops anyway, it wouldn't make sense to have this arg.PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃