Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown on python interpreter exit #1631

Merged
merged 12 commits into from
May 29, 2020
Merged

Graceful shutdown on python interpreter exit #1631

merged 12 commits into from
May 29, 2020

Conversation

justusschock
Copy link
Member

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

With this pr we can register functions to be run on interpreter exit
Partial fix to #1228

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Apr 27, 2020

Hello @justusschock! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-26 20:52:30 UTC

CHANGELOG.md Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team April 27, 2020 09:23
@justusschock justusschock mentioned this pull request Apr 27, 2020
5 tasks
@mergify
Copy link
Contributor

mergify bot commented Apr 27, 2020

This pull request is now in conflict... :(

@justusschock
Copy link
Member Author

justusschock commented Apr 27, 2020

unfortunately this seems to be impossible for SIGKILL: https://mail.python.org/pipermail/python-list/2003-June/206903.html

But for SIGTERM, SIGSEGV and SIGINT it should work. If a cluster decides to kill a process, would it use SIGKILL or SIGTERM?

Edit: for SLURM: They try with a SIGTERM first, followed by a SIGKILL if necessary: https://slurm.schedmd.com/scancel.html

@mergify
Copy link
Contributor

mergify bot commented Apr 27, 2020

This pull request is now in conflict... :(

@jeremyjordan
Copy link
Contributor

have you tested this in ddp mode to see if it resolves #965 ?

@justusschock
Copy link
Member Author

@jeremyjordan not explicitly, but I thought that our gpu tests run on ddp?

@justusschock
Copy link
Member Author

@Borda Any Idea, why tests are permanently auto-cancelled here?

@justusschock
Copy link
Member Author

Somehow tests are taking a really long time when they are already finished....

@Borda
Copy link
Member

Borda commented Apr 28, 2020

Any Idea, why tests are permanently auto-cancelled here?

I had something similar when I implemented __del__ in the logger and then it started hanging out like the pytest are not able to properly release resources but running the same locally was fine...
I was contacting support about this issue but not much advice...

@justusschock
Copy link
Member Author

What did you do in the end? Delete it?

@mergify mergify bot requested a review from a team April 29, 2020 00:22
@Borda Borda added the feature Is an improvement or enhancement label Apr 30, 2020
@Borda Borda added this to the 0.7.6 milestone Apr 30, 2020
@williamFalcon
Copy link
Contributor

@justusschock can we rebase master and restart these checks?

@Borda
Copy link
Member

Borda commented May 2, 2020

/rebase

@@ -299,6 +303,15 @@ def has_arg(self, *args):
"""Warning: this is just empty shell for code implemented in other class."""

def train(self):
# add signal handlers for process kills
def _signal_kill_handler(*args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this interfere with the HPC auto-save signal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I think it shouldn't, because this will be triggered only on python interpreter exit (with SIGTERM).

@mergify mergify bot requested a review from a team May 10, 2020 21:07
@Borda
Copy link
Member

Borda commented May 11, 2020

@justusschock do you think we can finish it this week?

@mergify
Copy link
Contributor

mergify bot commented May 12, 2020

This pull request is now in conflict... :(

@mergify mergify bot requested a review from a team May 17, 2020 12:38
@williamFalcon
Copy link
Contributor

@Borda @awaelchli these tests are timing out. did we change something?

@awaelchli
Copy link
Contributor

awaelchli commented May 17, 2020

Could it be that the shutdown code added here has side effects on the test suite? Maybe tests are waiting on some processes to finish that are stuck?

@justusschock
Copy link
Member Author

@williamFalcon they timedout straight from the beginning :D

@awaelchli That's what I'm afrai of. That way we can't really test this and we also cannot use our CI anymore with this feature...

@Borda
Copy link
Member

Borda commented May 18, 2020

Could it be that the shutdown code added here has side effects on the test suite? Maybe tests are waiting on some processes to finish that are stuck?

yes, it is this case but it is very hard to reproduce locally, I had something similar when I implemented simple __del__ in the logger and it started hanging on CI but passing locally

@Borda Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020
@codecov
Copy link

codecov bot commented May 26, 2020

Codecov Report

Merging #1631 into master will increase coverage by 0%.
The diff coverage is 94%.

@@          Coverage Diff           @@
##           master   #1631   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          74      74           
  Lines        4650    4663   +13     
======================================
+ Hits         4070    4082   +12     
- Misses        580     581    +1     

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great :)

@Borda Borda merged commit ceecf1c into master May 29, 2020
@Borda Borda deleted the at_exit branch May 29, 2020 14:20
@awaelchli
Copy link
Contributor

When building the docs I see this now:

Error in atexit._run_exitfuncs:
TypeError: run_training_teardown() missing 1 required positional argument: 'self'

Any idea if this is serious or can it be ignored in docs?

@Borda
Copy link
Member

Borda commented May 30, 2020

It is the same as #1999
This happens when you call run_training_teardown() on a class not instance...

justusschock added a commit that referenced this pull request Jun 29, 2020
* Fraceful shutdown on python interpreter exit

* Update CHANGELOG.md

* Update training_loop.py

* Update training_loop.py

* Update CHANGELOG.md

Co-Authored-By: Jirka Borovec <[email protected]>

* pep8, move to constant

* Update training_loop.py

* Update training_loop.py

* Update training_loop.py

* pep8, move to constant

* pep8

* timeout

Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Jirka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants