-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ref: decoupled ddp, ddp spawn (finish 3733) #3819
Conversation
output = self.training_step(args) | ||
return output | ||
|
||
def barrier(self, name: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def barrier(self, name: str = None): | |
def barrier(self, name: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion. i'll change in a separate PR since this affects every accelerator
Hello @williamFalcon! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-10-03 17:26:39 UTC |
Co-authored-by: ananthsub <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #3819 +/- ##
=======================================
+ Coverage 84% 87% +3%
=======================================
Files 111 111
Lines 8783 8947 +164
=======================================
+ Hits 7365 7784 +419
+ Misses 1418 1163 -255 |
Ok... turned out to fix many problems:
remove a bunch of rank_zero decorators.
We MUST make sure we don't create information asymmetry in children processes. State must be EXACTLY the same. Rank zero is there to not write to disk and that is only where it should be used.
Solved all the issues with DDP about setting VISIBLE_DEVICES and gpus indexing not working. Also solves the issues around calling fit or test or both.
Also solved the issue of how exactly to test ddp since we call other scripts. Overall we should have no regressions going forward because of these tests.
Stability of ddp has been improved dramatically!
Fixes #3422
Fixes #3117
Fixes #3730
Fixes #3605
Fixes #3403
Fixes #2826
Fixes #3071
Fixes #3023
Fixes #2913
Fixes #2590
Fixes #2529
cc @edenafek to add other tickets this fixes
Edit
PR is too big, splitting in parts.
Part 1: #3766
part 2: #3767
part 3: #3770
part 4: #3773
part 5: #3774
part 6: #3783
part 7: #3802
part 8: #3806
TODO:
ie: need
for x in range(5):
trainer.fit()