Move block_backward_sync from ParallelPlugin to DDPPlugins #9101

four4fish · 2021-08-25T06:36:43Z

What does this PR do?

Parallel plugin should be generic and self-contained. Move reference to DistributedDataParallel from parallel block_backward_sync

Subtask 2 for "remove reference to DistributedDataParallel from parallel plugin"
Subtask 1 is #8943

Tests:
python -m pytest -v tests/plugins passed
python -m pytest -v tests/accelerators passed
python -m pytest -v tests/trainer passed

Fixes #<issue_number>

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-08-25T07:17:28Z

Codecov Report

Merging #9101 (cc6284a) into master (de57fef) will decrease coverage by 4%.
The diff coverage is 50%.

@@           Coverage Diff           @@
##           master   #9101    +/-   ##
=======================================
- Coverage      92%     88%    -4%     
=======================================
  Files         176     176            
  Lines       14663   14670     +7     
=======================================
- Hits        13496   12892   -604     
- Misses       1167    1778   +611

ananthsub · 2021-08-25T07:24:40Z

pytorch_lightning/plugins/training_type/ddp.py

+    @contextmanager
+    def block_backward_sync(self):
+        """
+        Blocks ddp sync gradients behaviour on backwards pass.
+        This is useful for skipping sync when accumulating gradients, reducing communication overhead
+        Returns: context manager with sync behaviour off
+        """
+        if isinstance(self.model, DistributedDataParallel):
+            with self.model.no_sync():
+                yield None
+        else:
+            yield None


n00b question: do you think this ought to be its own mini interface to represent this trait?

class PluginWithBlockBackwardSync(ABC): @contextmanager @abstractmethod def block_backward_sync(self) -> Generator:

this way, we only need to check isinstance(isinstance(self.trainer.training_type_plugin, PluginWithBlockBackwardSync) in the training batch loop.

otherwise i'm not sure about the isinstance check for custom plugins that require this

(pls ignore the verbose naming)

Yeah, that's a good point! If custom plugin needs this.

@tchaton @awaelchli @justusschock what do you think?

Yes, I think we can explore making plugins more composable too.

@ananthsub I agree that we should explore this. My only concern in this direction (we had/have something similar for the trainer and module) is that sometimes it becomes hard to track what is implemented where (especially when debugging), which is why at some point we decided to avoid patterns like this.

I still think though, that together with good purely abstract interfaces this should be possible and is likely the best way to tackle this.

Yes, I think it is a balance of good code taste with reliable / general abstractions.

ananthsub · 2021-08-25T07:39:20Z

pytorch_lightning/plugins/training_type/ddp.py

+        This is useful for skipping sync when accumulating gradients, reducing communication overhead
+        Returns: context manager with sync behaviour off
+        """
+        if isinstance(self.model, DistributedDataParallel):


fwiw, ShardedDataParallel supports this too, but we're not taking advantage of it now due to this check 😞
https://fairscale.readthedocs.io/en/latest/_modules/fairscale/nn/data_parallel/sharded_ddp.html#ShardedDataParallel.no_sync

this is also masked with the current inheritance structure, as sharded doesn't override this
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/plugins/training_type/sharded.py

ideally, splitting up this inheritance like this

Parallel / / \ \ DDP Sharded FDSP Deepspeed

will make these opportunities more apparent

fyi @SeanNaren

Hey @ananthsub,

I am a bit concerned that making all plugins subclass directly from parallel would result in lot of duplicated code and higher maintenance cost, especially for sharded.

Parallel / / \ \ DDP Sharded FDSP Deepspeed

Alternatively, we could remove the if isinstance(self.model, DistributedDataParallel): check there.

Removing the check for DDP would affect DeepSpeed and fully sharded.

Regarding code duplication, I think if we better abstract the subprocess launch or start_processes in the DDP and DDP spawn plugins to ensure that code can be shared, would this address your concern? Are there other parts of the code you're worried about duplication.

My concern with the inheritance we have now is if things are silently not called.

@tchaton especially with FSDP and Deepspeed, checkpoint loading and saving is so different from ddp and sharded

IMO, I would prefer to avoid it, but it can have some pros too as you shared there.
I am not against FB trying to PoC a refactor.
@justusschock do you agree with this as you designed it based on inheritance ?

@tchaton I designed it based on inheritance to avoid code duplication. However, as we get more and more different kinds of plugins, I think it could make sense to split them out to minimal mixins (like the one @ananthsub shared) shared above and then make the actual plugin inherit them.

I know that we decided against mixins, but I think those mixins together with a purely abstract interface class are the best way to tackle this.

Sounds good to me ! @ananthsub Mind creating a [RFC] For Refactoring Accelerator around base components and tag the name of the person assign on your side.

tchaton · 2021-08-25T08:07:19Z

pytorch_lightning/loops/batch/training_batch_loop.py

-            self.trainer.lightning_module.automatic_optimization or should_block_sync
-        ):
+        if (
+            isinstance(self.trainer.training_type_plugin, DDPPlugin)


DDPPlugin or DDPSpawnPlugin right ?

tchaton · 2021-08-25T08:07:57Z

pytorch_lightning/loops/batch/training_batch_loop.py

+            isinstance(self.trainer.training_type_plugin, DDPPlugin)
+            or isinstance(self.trainer.training_type_plugin, DDPPlugin)


Suggested change

isinstance(self.trainer.training_type_plugin, DDPPlugin)

or isinstance(self.trainer.training_type_plugin, DDPPlugin)

isinstance(self.trainer.training_type_plugin, (DDPPlugin, DDPSpawnPlugin))

tchaton · 2021-08-25T08:10:40Z

pytorch_lightning/plugins/training_type/ddp.py

+        This is useful for skipping sync when accumulating gradients, reducing communication overhead
+        Returns: context manager with sync behaviour off
+        """
+        if isinstance(self.model, DistributedDataParallel):


Hey @ananthsub,

I am a bit concerned that making all plugins subclass directly from parallel would result in lot of duplicated code and higher maintenance cost, especially for sharded.

Parallel / / \ \ DDP Sharded FDSP Deepspeed

tchaton · 2021-08-25T08:11:10Z

pytorch_lightning/plugins/training_type/ddp.py

+        This is useful for skipping sync when accumulating gradients, reducing communication overhead
+        Returns: context manager with sync behaviour off
+        """
+        if isinstance(self.model, DistributedDataParallel):


Alternatively, we could remove the if isinstance(self.model, DistributedDataParallel): check there.

Borda · 2022-09-12T20:14:33Z

@carmocca @awaelchli, how about this one? 🦦

carmocca · 2022-11-08T16:08:59Z

This has been implemented for Lite in #14966. It'll eventually trickle into the PL interfaces as we merge implementations. cc @awaelchli

four4fish requested review from awaelchli, carmocca, justusschock, SeanNaren, tchaton and williamFalcon as code owners August 25, 2021 06:36

four4fish force-pushed the remove/block_backward branch from 1c97633 to 0f6ddf7 Compare August 25, 2021 06:43

four4fish requested review from Borda and kaushikb11 as code owners August 25, 2021 06:43

Move block_backward_sync from ParallelPlugin to DDPPlugins

cc6284a

four4fish force-pushed the remove/block_backward branch from 0f6ddf7 to cc6284a Compare August 25, 2021 06:45

ananthsub reviewed Aug 25, 2021

View reviewed changes

ananthsub added distributed Generic distributed-related topic refactor labels Aug 25, 2021

ananthsub added this to the v1.5 milestone Aug 25, 2021

tchaton reviewed Aug 25, 2021

View reviewed changes

mergify bot added the has conflicts label Aug 25, 2021

This was referenced Aug 25, 2021

Move add_to_queue/get_from_queue to DDPSpawnPlugin #9118

Merged

Avoid wrapping LightningModule in DDP plugins when not fitting #9096

Merged

Fix gradient accumulation for ShardedDataParallel #9122

Merged

awaelchli mentioned this pull request Aug 30, 2021

move block_ddp_sync_behaviour to utilities #9192

Merged

11 tasks

four4fish mentioned this pull request Aug 31, 2021

Create Mixin for block backward sync and deprecate block_backward_sync from Parallel Plugin #9234

Closed

awaelchli modified the milestones: v1.5, v1.6 Nov 1, 2021

carmocca removed this from the 1.6 milestone Mar 28, 2022

mergify bot added has conflicts and removed has conflicts labels Sep 23, 2022

carmocca closed this Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move block_backward_sync from ParallelPlugin to DDPPlugins #9101

Move block_backward_sync from ParallelPlugin to DDPPlugins #9101

four4fish commented Aug 25, 2021 •

edited by awaelchli

Loading

codecov bot commented Aug 25, 2021

ananthsub Aug 25, 2021

four4fish Aug 25, 2021

ananthsub Aug 25, 2021

tchaton Aug 26, 2021

justusschock Aug 26, 2021

tchaton Aug 26, 2021

ananthsub Aug 25, 2021 •

edited

Loading

tchaton Aug 25, 2021

tchaton Aug 25, 2021

ananthsub Aug 25, 2021 •

edited

Loading

ananthsub Aug 25, 2021

tchaton Aug 26, 2021 •

edited

Loading

justusschock Aug 26, 2021

tchaton Aug 26, 2021

tchaton Aug 25, 2021

tchaton Aug 25, 2021

tchaton Aug 25, 2021

tchaton Aug 25, 2021

Borda commented Sep 12, 2022

carmocca commented Nov 8, 2022

		isinstance(self.trainer.training_type_plugin, DDPPlugin)
		or isinstance(self.trainer.training_type_plugin, DDPPlugin)

Move block_backward_sync from ParallelPlugin to DDPPlugins #9101

Move block_backward_sync from ParallelPlugin to DDPPlugins #9101

Conversation

four4fish commented Aug 25, 2021 • edited by awaelchli Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

codecov bot commented Aug 25, 2021

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananthsub Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananthsub Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaton Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Borda commented Sep 12, 2022

carmocca commented Nov 8, 2022

four4fish commented Aug 25, 2021 •

edited by awaelchli

Loading

ananthsub Aug 25, 2021 •

edited

Loading

ananthsub Aug 25, 2021 •

edited

Loading

tchaton Aug 26, 2021 •

edited

Loading