Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ref: decoupled ddp, ddp spawn (finish 3733) #3819

Merged
merged 18 commits into from
Oct 3, 2020
Merged

Conversation

williamFalcon
Copy link
Contributor

@williamFalcon williamFalcon commented Oct 3, 2020

Ok... turned out to fix many problems:

  1. remove a bunch of rank_zero decorators.
    We MUST make sure we don't create information asymmetry in children processes. State must be EXACTLY the same. Rank zero is there to not write to disk and that is only where it should be used.

  2. Solved all the issues with DDP about setting VISIBLE_DEVICES and gpus indexing not working. Also solves the issues around calling fit or test or both.

  3. Also solved the issue of how exactly to test ddp since we call other scripts. Overall we should have no regressions going forward because of these tests.

Stability of ddp has been improved dramatically!

Fixes #3422
Fixes #3117
Fixes #3730
Fixes #3605
Fixes #3403
Fixes #2826
Fixes #3071
Fixes #3023
Fixes #2913
Fixes #2590
Fixes #2529

cc @edenafek to add other tickets this fixes

Edit

PR is too big, splitting in parts.

Part 1: #3766
part 2: #3767
part 3: #3770
part 4: #3773
part 5: #3774
part 6: #3783
part 7: #3802
part 8: #3806

TODO:

  • decouple slurm + TE from each backend
  • Remove limitation of running script only once...
    ie: need

for x in range(5):
trainer.fit()

@mergify mergify bot requested a review from a team October 3, 2020 16:27
output = self.training_step(args)
return output

def barrier(self, name: str = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def barrier(self, name: str = None):
def barrier(self, name: Optional[str] = None):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion. i'll change in a separate PR since this affects every accelerator

@pep8speaks
Copy link

pep8speaks commented Oct 3, 2020

Hello @williamFalcon! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-03 17:26:39 UTC

@codecov
Copy link

codecov bot commented Oct 3, 2020

Codecov Report

Merging #3819 into master will increase coverage by 3%.
The diff coverage is 31%.

@@           Coverage Diff           @@
##           master   #3819    +/-   ##
=======================================
+ Coverage      84%     87%    +3%     
=======================================
  Files         111     111            
  Lines        8783    8947   +164     
=======================================
+ Hits         7365    7784   +419     
+ Misses       1418    1163   -255     

@williamFalcon williamFalcon merged commit 35d1111 into master Oct 3, 2020
@williamFalcon williamFalcon deleted the ddp5 branch October 4, 2020 17:37
@Borda Borda added the bug Something isn't working label Oct 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment