Skips DDP parameter sync #4301

justusschock · 2020-10-22T09:21:15Z

What does this PR do?

Skips DDP parameter sync in forward and backward whenever possible.

Fixes #4092 and fixes #2595

Is a revamp of #4146 after adding optimiser closures.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-10-22T09:21:19Z

Hello @justusschock! Thanks for updating this PR.

In the file pytorch_lightning/trainer/training_loop.py:

Line 688:121: E501 line too long (121 > 120 characters)

Comment last updated at 2020-10-29 17:04:11 UTC

codecov · 2020-10-22T09:58:30Z

Codecov Report

Merging #4301 into master will increase coverage by 2%.
The diff coverage is 90%.

@@           Coverage Diff           @@
##           master   #4301    +/-   ##
=======================================
+ Coverage      91%     93%    +2%     
=======================================
  Files         113     111     -2     
  Lines        8301    8134   -167     
=======================================
+ Hits         7527    7553    +26     
+ Misses        774     581   -193

williamFalcon · 2020-10-22T10:00:57Z

cc @ananthsub for review

ananthsub · 2020-10-26T16:39:29Z

pytorch_lightning/trainer/training_loop.py

+
+                    # no ddp sync at the beginning of forward or backward due to parameter changes in this or last step required
+                    no_sync = self._updated_model_last_step and isinstance(self.trainer.model, torch.nn.parallel.DistributedDataParallel)
+                    if no_sync:
+                        self.trainer.model.no_sync.__enter__()
+
                    self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens)
+
+                    if no_sync:
+                        self.trainer.model.__exit__()
+


is it possible for this to logic to live in the DDP accelerators as opposed to the general training loop?

pytorch_lightning/trainer/training_loop.py

ananthsub · 2020-10-26T19:25:30Z

pytorch_lightning/trainer/training_loop.py

@@ -16,6 +16,7 @@
 from copy import copy, deepcopy

 import numpy as np
+from numpy.lib.arraysetops import isin


where's this used?

it's not. was added by my IDE as auto import during typing isinstance :D
removed it

awaelchli · 2020-10-26T20:38:51Z

This is implementing the same as requested in #2595, right?

ananthsub

besides my small comments, the PR looks good to me. the accelerator part is my only significant question. if it's not immediately obvious how we can move to this check to the DDP accelerators, then let's land this to reap the perf savings and figure out the refactor later. i really like that the training loop now barely has any references to pytorch specifics

Co-authored-by: ananthsub <[email protected]>

justusschock · 2020-10-27T06:49:57Z

@awaelchli yes, haven't seen that one.

pytorch_lightning/trainer/training_loop.py

ananthsub · 2020-10-28T20:44:05Z

pytorch_lightning/trainer/training_loop.py

+                    # perform dpp sync only when performing optimizer_step
+                    with self.block_ddp_sync_behaviour():
+                        self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens)
+


do we need block_ddp_sync_behaviour ?

Suggested change

# perform dpp sync only when performing optimizer_step

with self.block_ddp_sync_behaviour():

self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens)

# perform dpp sync only when performing optimizer_step

if isinstance(self.trainer.model, torch.nn.parallel.DistributedDataParallel):

with self.trainer.model.no_sync():

self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens)

else:

self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens)

Just to make it more explicit for our readers :)

@ananthsub @tchaton requested this to make it more readable and hide the conditions for that in the context manager

@ananthsub
The training and evaluation loop should look absolutely perfect and simple to understand.
As a new coder, I should recognise my training loop. As a new coder, I have no knowledge about ddp and your suggested change would have confused me :)

One great example of this PR is: not (accumulation_done or is_final_batch) -> should_accumulate. It is pretty simple, but make a clear statement about what is happening. We should try to enforce those pattern as much as possible :)

I hope it makes sense :)

tchaton

Great catch !

tchaton · 2020-10-29T08:22:17Z

pytorch_lightning/trainer/training_loop.py

@@ -46,6 +46,7 @@ def __init__(self, trainer):
        self.automatic_optimization = True
        self._curr_step_result = None
        self._cur_grad_norm_dict = None
+        self._updated_model_last_step = False


Can you please remove _updated_model_last_step . You don't use it anymore :)

tchaton · 2020-10-29T08:23:08Z

pytorch_lightning/trainer/training_loop.py

+                    # perform dpp sync only when performing optimizer_step
+                    with self.block_ddp_sync_behaviour():
+                        self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens)
+


Just to make it more explicit for our readers :)

tchaton

Great addition !

s-rog

Awesome! Should be a nice performance boost!

* ddp no-sync * Update pytorch_lightning/trainer/training_loop.py Co-authored-by: ananthsub <[email protected]> * Update training_loop.py * factor __enter__ and __exit__ out to separate context manager * delete _updated_model_last_step Co-authored-by: justusschock <[email protected]> Co-authored-by: Teddy Koker <[email protected]> Co-authored-by: ananthsub <[email protected]> Co-authored-by: chaton <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> (cherry picked from commit bbd81df)

ddp no-sync

5b03f1d

justusschock added the feature Is an improvement or enhancement label Oct 22, 2020

justusschock requested review from ananyahjha93, awaelchli, Borda, nateraw, SeanNaren, tchaton, teddykoker and williamFalcon as code owners October 22, 2020 09:21

justusschock self-assigned this Oct 22, 2020

mergify bot requested a review from a team October 22, 2020 09:21

Merge branch 'master' into ddp_no_sync

43e6a14

ananthsub reviewed Oct 26, 2020

View reviewed changes

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

ananthsub reviewed Oct 26, 2020

View reviewed changes

ananthsub approved these changes Oct 27, 2020

View reviewed changes

Update pytorch_lightning/trainer/training_loop.py

eccb8e3

Co-authored-by: ananthsub <[email protected]>

Update training_loop.py

b0ab9a7

tchaton approved these changes Oct 27, 2020

View reviewed changes

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

tchaton self-requested a review October 27, 2020 09:44

tchaton and others added 2 commits October 27, 2020 09:52

Merge branch 'master' into ddp_no_sync

30eed75

factor __enter__ and __exit__ out to separate context manager

338a9a7

ananthsub reviewed Oct 28, 2020

View reviewed changes

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

ananthsub mentioned this pull request Oct 28, 2020

Change accelerator in backward to use DDP-wrapped model #4415

Closed

8 tasks

ananthsub reviewed Oct 28, 2020

View reviewed changes

tchaton requested changes Oct 29, 2020

View reviewed changes

delete _updated_model_last_step

59a5e5a

tchaton approved these changes Oct 29, 2020

View reviewed changes

Merge branch 'master' into ddp_no_sync

0149d5c

s-rog approved these changes Oct 29, 2020

View reviewed changes

rohitgr7 approved these changes Oct 29, 2020

View reviewed changes

Merge branch 'master' into ddp_no_sync

36d5c88

rohitgr7 merged commit bbd81df into master Oct 29, 2020

rohitgr7 deleted the ddp_no_sync branch October 29, 2020 17:36

edenlightning added this to the 1.0.x milestone Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skips DDP parameter sync #4301

Skips DDP parameter sync #4301

justusschock commented Oct 22, 2020 •

edited

Loading

pep8speaks commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 22, 2020 •

edited

Loading

williamFalcon commented Oct 22, 2020

ananthsub Oct 26, 2020

ananthsub Oct 26, 2020

justusschock Oct 27, 2020

awaelchli commented Oct 26, 2020 •

edited

Loading

ananthsub left a comment

justusschock commented Oct 27, 2020

ananthsub Oct 28, 2020 •

edited

Loading

tchaton Oct 29, 2020

justusschock Oct 29, 2020 •

edited

Loading

tchaton Oct 29, 2020 •

edited

Loading

tchaton left a comment

tchaton Oct 29, 2020

tchaton Oct 29, 2020

tchaton left a comment

s-rog left a comment

Skips DDP parameter sync #4301

Skips DDP parameter sync #4301

Conversation

justusschock commented Oct 22, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Oct 22, 2020 • edited Loading

Comment last updated at 2020-10-29 17:04:11 UTC

codecov bot commented Oct 22, 2020 • edited Loading

Codecov Report

williamFalcon commented Oct 22, 2020

ananthsub Oct 26, 2020

Choose a reason for hiding this comment

ananthsub Oct 26, 2020

Choose a reason for hiding this comment

justusschock Oct 27, 2020

Choose a reason for hiding this comment

awaelchli commented Oct 26, 2020 • edited Loading

ananthsub left a comment

Choose a reason for hiding this comment

justusschock commented Oct 27, 2020

ananthsub Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

tchaton Oct 29, 2020

Choose a reason for hiding this comment

justusschock Oct 29, 2020 • edited Loading

Choose a reason for hiding this comment

tchaton Oct 29, 2020 • edited Loading

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

tchaton Oct 29, 2020

Choose a reason for hiding this comment

tchaton Oct 29, 2020

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

s-rog left a comment

Choose a reason for hiding this comment

justusschock commented Oct 22, 2020 •

edited

Loading

pep8speaks commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 22, 2020 •

edited

Loading

awaelchli commented Oct 26, 2020 •

edited

Loading

ananthsub Oct 28, 2020 •

edited

Loading

justusschock Oct 29, 2020 •

edited

Loading

tchaton Oct 29, 2020 •

edited

Loading