-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] Add mechanism to prevent deadlock for DDP on Exception Trigger #8167
Conversation
Hello @tchaton! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-06-28 17:45:42 UTC |
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #8167 +/- ##
=======================================
- Coverage 93% 93% -0%
=======================================
Files 211 211
Lines 13440 13607 +167
=======================================
+ Hits 12450 12602 +152
- Misses 990 1005 +15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@tchaton thanks for working on this. |
Everything has been moved to the plugin. Just wanted to confirm the logic :) Best, |
pytorch_lightning/plugins/training_type/training_type_plugin.py
Outdated
Show resolved
Hide resolved
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM just unsure about how the test works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting approach @tchaton !
How does this work if the trainer is running with torchelastic? and as @ethanwharris points out, what happens in the multinode case?
Hey @ananthsub. Thanks for checking this out ! In the case of Best, |
Co-authored-by: Jirka Borovec <[email protected]>
for pid in self._pids: | ||
if pid != os.getpid(): | ||
os.kill(pid, signal.SIGKILL) | ||
shutil.rmtree(sync_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this clean-up command be part of __del__
so we make sure it always gets removed? Or have it run in acceleratotr.on_train_end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we don't want to kill the processes.
the ddp processes must survive until the end of the script.
that is, unless there was an exception and we want to exit the program, which is what @tchaton tries to handle here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring specifically to the line
shutil.rmtree(sync_dir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @carmocca,
Yes, this will make it cleaner.
…er (#8167) * add mechanism to prevent deadlock * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 + update changelog * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * remove space * resolve bugs * overwrite config * update on comments * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * update * update test with comments * Update pytorch_lightning/plugins/training_type/parallel.py Co-authored-by: Jirka Borovec <[email protected]> * update on comments Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
…er (#8167) * add mechanism to prevent deadlock * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 + update changelog * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * remove space * resolve bugs * overwrite config * update on comments * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * update * update test with comments * Update pytorch_lightning/plugins/training_type/parallel.py Co-authored-by: Jirka Borovec <[email protected]> * update on comments Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
…er (#8167) * add mechanism to prevent deadlock * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 + update changelog * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * remove space * resolve bugs * overwrite config * update on comments * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * update * update test with comments * Update pytorch_lightning/plugins/training_type/parallel.py Co-authored-by: Jirka Borovec <[email protected]> * update on comments Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
…er (#8167) * add mechanism to prevent deadlock * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 + update changelog * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * remove space * resolve bugs * overwrite config * update on comments * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * update * update test with comments * Update pytorch_lightning/plugins/training_type/parallel.py Co-authored-by: Jirka Borovec <[email protected]> * update on comments Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
…er (#8167) * add mechanism to prevent deadlock * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 + update changelog * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * remove space * resolve bugs * overwrite config * update on comments * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * update * update test with comments * Update pytorch_lightning/plugins/training_type/parallel.py Co-authored-by: Jirka Borovec <[email protected]> * update on comments Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
What does this PR do?
This PR introduces a mechanism to prevent deadlock.
Mechanism:
On failure for a given process:
Other approaches tried:
Fixes #8130
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃