-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch Elastic DDP DeadLock bug fix #8655
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #8655 +/- ##
=======================================
- Coverage 93% 89% -4%
=======================================
Files 168 167 -1
Lines 13984 13970 -14
=======================================
- Hits 12948 12397 -551
- Misses 1036 1573 +537 |
for more information, see https://pre-commit.ci
Hey @ananthsub. I will add a Multi Node test running on Grid using Torch Elastic soon, so our support for Torch Elastic is higher. Best, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More important to get this in even if there is no test atm I think
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
What does this PR do?
This PR introduces some changes to better support multi node dead lock mechanism prevention when processes are created by the cluster.
This works.
TorchElastic detects rank 1 as Terminated as expected. Therefore, rank 0 has succeeded and rank 1 failed. Therefore the run is declared as failed. Depending on use-case, this can be a wrong behaviour.
TorchElastic should expose a kill function, so it registers that we manual kill a process.
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃