-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful shutdown on python interpreter exit #1631
Conversation
Hello @justusschock! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-05-26 20:52:30 UTC |
This pull request is now in conflict... :( |
unfortunately this seems to be impossible for SIGKILL: https://mail.python.org/pipermail/python-list/2003-June/206903.html But for SIGTERM, SIGSEGV and SIGINT it should work. If a cluster decides to kill a process, would it use SIGKILL or SIGTERM? Edit: for SLURM: They try with a SIGTERM first, followed by a SIGKILL if necessary: https://slurm.schedmd.com/scancel.html |
This pull request is now in conflict... :( |
have you tested this in ddp mode to see if it resolves #965 ? |
@jeremyjordan not explicitly, but I thought that our gpu tests run on ddp? |
@Borda Any Idea, why tests are permanently auto-cancelled here? |
Somehow tests are taking a really long time when they are already finished.... |
I had something similar when I implemented |
What did you do in the end? Delete it? |
@justusschock can we rebase master and restart these checks? |
/rebase |
@@ -299,6 +303,15 @@ def has_arg(self, *args): | |||
"""Warning: this is just empty shell for code implemented in other class.""" | |||
|
|||
def train(self): | |||
# add signal handlers for process kills | |||
def _signal_kill_handler(*args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this interfere with the HPC auto-save signal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I think it shouldn't, because this will be triggered only on python interpreter exit (with SIGTERM).
@justusschock do you think we can finish it this week? |
This pull request is now in conflict... :( |
@Borda @awaelchli these tests are timing out. did we change something? |
Could it be that the shutdown code added here has side effects on the test suite? Maybe tests are waiting on some processes to finish that are stuck? |
@williamFalcon they timedout straight from the beginning :D @awaelchli That's what I'm afrai of. That way we can't really test this and we also cannot use our CI anymore with this feature... |
yes, it is this case but it is very hard to reproduce locally, I had something similar when I implemented simple |
Co-Authored-By: Jirka Borovec <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1631 +/- ##
======================================
Coverage 88% 88%
======================================
Files 74 74
Lines 4650 4663 +13
======================================
+ Hits 4070 4082 +12
- Misses 580 581 +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great :)
When building the docs I see this now:
Any idea if this is serious or can it be ignored in docs? |
It is the same as #1999 |
* Fraceful shutdown on python interpreter exit * Update CHANGELOG.md * Update training_loop.py * Update training_loop.py * Update CHANGELOG.md Co-Authored-By: Jirka Borovec <[email protected]> * pep8, move to constant * Update training_loop.py * Update training_loop.py * Update training_loop.py * pep8, move to constant * pep8 * timeout Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Jirka <[email protected]>
Before submitting
What does this PR do?
With this pr we can register functions to be run on interpreter exit
Partial fix to #1228
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃