Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move block_backward_sync from ParallelPlugin to DDPPlugins #9101

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Removed `InterBatchProcessor` in favor of `DataLoaderIterDataFetcher` ([#9052](https://github.com/PyTorchLightning/pytorch-lightning/pull/9052))


- Removed `block_backward_sync` from `ParallelPlugin` and added to `DDPPlugin` and `DDPSpawnPlugin` ([#9101](https://github.com/PyTorchLightning/pytorch-lightning/pull/9101))


### Fixed

- Fixed save/load/resume from checkpoint for DeepSpeed Plugin (
Expand Down
9 changes: 5 additions & 4 deletions pytorch_lightning/loops/batch/training_batch_loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
_process_training_step_output,
check_finite_loss,
)
from pytorch_lightning.plugins import ParallelPlugin
from pytorch_lightning.plugins import DDPPlugin, DDPSpawnPlugin
from pytorch_lightning.trainer.progress import OptimizationProgress
from pytorch_lightning.trainer.supporters import TensorRunningAccum
from pytorch_lightning.utilities import AMPType, AttributeDict, DeviceType, grad_norm
Expand Down Expand Up @@ -430,9 +430,10 @@ def block_ddp_sync_behaviour(self, should_block_sync: bool = False) -> Generator
Returns:
context manager with sync behaviour off
"""
if isinstance(self.trainer.training_type_plugin, ParallelPlugin) and (
self.trainer.lightning_module.automatic_optimization or should_block_sync
):
if (
isinstance(self.trainer.training_type_plugin, DDPPlugin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DDPPlugin or DDPSpawnPlugin right ?

or isinstance(self.trainer.training_type_plugin, DDPPlugin)
Comment on lines +434 to +435
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
isinstance(self.trainer.training_type_plugin, DDPPlugin)
or isinstance(self.trainer.training_type_plugin, DDPPlugin)
isinstance(self.trainer.training_type_plugin, (DDPPlugin, DDPSpawnPlugin))

) and (self.trainer.lightning_module.automatic_optimization or should_block_sync):
with self.trainer.training_type_plugin.block_backward_sync():
yield None
else:
Expand Down
14 changes: 14 additions & 0 deletions pytorch_lightning/plugins/training_type/ddp.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import sys
import tempfile
import time
from contextlib import contextmanager
from pathlib import Path
from time import sleep
from typing import Any, Dict, List, Optional, Union
Expand Down Expand Up @@ -442,3 +443,16 @@ def reconciliate_processes(self, trace: str):
os.kill(pid, signal.SIGKILL)
shutil.rmtree(sync_dir)
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")

@contextmanager
def block_backward_sync(self):
"""
Blocks ddp sync gradients behaviour on backwards pass.
This is useful for skipping sync when accumulating gradients, reducing communication overhead
Returns: context manager with sync behaviour off
"""
if isinstance(self.model, DistributedDataParallel):
Copy link
Contributor

@ananthsub ananthsub Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, ShardedDataParallel supports this too, but we're not taking advantage of it now due to this check 😞
https://fairscale.readthedocs.io/en/latest/_modules/fairscale/nn/data_parallel/sharded_ddp.html#ShardedDataParallel.no_sync

this is also masked with the current inheritance structure, as sharded doesn't override this
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/plugins/training_type/sharded.py

ideally, splitting up this inheritance like this

                       Parallel
           /      /             \       \
         DDP      Sharded      FDSP      Deepspeed

will make these opportunities more apparent

fyi @SeanNaren

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ananthsub,

I am a bit concerned that making all plugins subclass directly from parallel would result in lot of duplicated code and higher maintenance cost, especially for sharded.

                       Parallel
           /      /             \       \
         DDP      Sharded      FDSP      Deepspeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could remove the if isinstance(self.model, DistributedDataParallel): check there.

Copy link
Contributor

@ananthsub ananthsub Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the check for DDP would affect DeepSpeed and fully sharded.

Regarding code duplication, I think if we better abstract the subprocess launch or start_processes in the DDP and DDP spawn plugins to ensure that code can be shared, would this address your concern? Are there other parts of the code you're worried about duplication.

My concern with the inheritance we have now is if things are silently not called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaton especially with FSDP and Deepspeed, checkpoint loading and saving is so different from ddp and sharded

Copy link
Contributor

@tchaton tchaton Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I would prefer to avoid it, but it can have some pros too as you shared there.
I am not against FB trying to PoC a refactor.
@justusschock do you agree with this as you designed it based on inheritance ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaton I designed it based on inheritance to avoid code duplication. However, as we get more and more different kinds of plugins, I think it could make sense to split them out to minimal mixins (like the one @ananthsub shared) shared above and then make the actual plugin inherit them.

I know that we decided against mixins, but I think those mixins together with a purely abstract interface class are the best way to tackle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me ! @ananthsub Mind creating a [RFC] For Refactoring Accelerator around base components and tag the name of the person assign on your side.

with self.model.no_sync():
yield None
else:
yield None
Comment on lines +447 to +458
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: do you think this ought to be its own mini interface to represent this trait?

class PluginWithBlockBackwardSync(ABC):

    @contextmanager
    @abstractmethod
    def block_backward_sync(self) -> Generator:

this way, we only need to check isinstance(isinstance(self.trainer.training_type_plugin, PluginWithBlockBackwardSync) in the training batch loop.

otherwise i'm not sure about the isinstance check for custom plugins that require this

(pls ignore the verbose naming)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point! If custom plugin needs this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaton @awaelchli @justusschock what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can explore making plugins more composable too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub I agree that we should explore this. My only concern in this direction (we had/have something similar for the trainer and module) is that sometimes it becomes hard to track what is implemented where (especially when debugging), which is why at some point we decided to avoid patterns like this.

I still think though, that together with good purely abstract interfaces this should be possible and is likely the best way to tackle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is a balance of good code taste with reliable / general abstractions.

14 changes: 14 additions & 0 deletions pytorch_lightning/plugins/training_type/ddp_spawn.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import logging
import os
import re
from contextlib import contextmanager
from multiprocessing.queues import SimpleQueue
from typing import Any, Dict, List, Optional, Union

Expand Down Expand Up @@ -364,3 +365,16 @@ def register_plugins(cls, plugin_registry: Dict) -> None:
description="DDPSpawn Plugin with `find_unused_parameters` as False",
find_unused_parameters=False,
)

@contextmanager
def block_backward_sync(self):
"""
Blocks ddp sync gradients behaviour on backwards pass.
This is useful for skipping sync when accumulating gradients, reducing communication overhead
Returns: context manager with sync behaviour off
"""
if isinstance(self.model, DistributedDataParallel):
with self.model.no_sync():
yield None
else:
yield None
14 changes: 0 additions & 14 deletions pytorch_lightning/plugins/training_type/parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
# limitations under the License.
import os
from abc import ABC, abstractmethod
from contextlib import contextmanager
from typing import Any, List, Optional

import torch
Expand Down Expand Up @@ -121,19 +120,6 @@ def configure_sync_batchnorm(model: "pl.LightningModule") -> "pl.LightningModule
"""
return torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

@contextmanager
def block_backward_sync(self):
"""
Blocks ddp sync gradients behaviour on backwards pass.
This is useful for skipping sync when accumulating gradients, reducing communication overhead
Returns: context manager with sync behaviour off
"""
if isinstance(self.model, DistributedDataParallel):
with self.model.no_sync():
yield None
else:
yield None

def teardown(self) -> None:
# Un-reference the wrapper if any was used.
# todo (tchaton): Add support for all plugins.
Expand Down