-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move block_backward_sync from ParallelPlugin to DDPPlugins #9101
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -31,7 +31,7 @@ | |||||||
_process_training_step_output, | ||||||||
check_finite_loss, | ||||||||
) | ||||||||
from pytorch_lightning.plugins import ParallelPlugin | ||||||||
from pytorch_lightning.plugins import DDPPlugin, DDPSpawnPlugin | ||||||||
from pytorch_lightning.trainer.progress import OptimizationProgress | ||||||||
from pytorch_lightning.trainer.supporters import TensorRunningAccum | ||||||||
from pytorch_lightning.utilities import AMPType, AttributeDict, DeviceType, grad_norm | ||||||||
|
@@ -430,9 +430,10 @@ def block_ddp_sync_behaviour(self, should_block_sync: bool = False) -> Generator | |||||||
Returns: | ||||||||
context manager with sync behaviour off | ||||||||
""" | ||||||||
if isinstance(self.trainer.training_type_plugin, ParallelPlugin) and ( | ||||||||
self.trainer.lightning_module.automatic_optimization or should_block_sync | ||||||||
): | ||||||||
if ( | ||||||||
isinstance(self.trainer.training_type_plugin, DDPPlugin) | ||||||||
or isinstance(self.trainer.training_type_plugin, DDPPlugin) | ||||||||
Comment on lines
+434
to
+435
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
) and (self.trainer.lightning_module.automatic_optimization or should_block_sync): | ||||||||
with self.trainer.training_type_plugin.block_backward_sync(): | ||||||||
yield None | ||||||||
else: | ||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,7 @@ | |
import sys | ||
import tempfile | ||
import time | ||
from contextlib import contextmanager | ||
from pathlib import Path | ||
from time import sleep | ||
from typing import Any, Dict, List, Optional, Union | ||
|
@@ -442,3 +443,16 @@ def reconciliate_processes(self, trace: str): | |
os.kill(pid, signal.SIGKILL) | ||
shutil.rmtree(sync_dir) | ||
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}") | ||
|
||
@contextmanager | ||
def block_backward_sync(self): | ||
""" | ||
Blocks ddp sync gradients behaviour on backwards pass. | ||
This is useful for skipping sync when accumulating gradients, reducing communication overhead | ||
Returns: context manager with sync behaviour off | ||
""" | ||
if isinstance(self.model, DistributedDataParallel): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fwiw, ShardedDataParallel supports this too, but we're not taking advantage of it now due to this check 😞 this is also masked with the current inheritance structure, as ideally, splitting up this inheritance like this
will make these opportunities more apparent fyi @SeanNaren There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @ananthsub, I am a bit concerned that making all plugins subclass directly from parallel would result in lot of duplicated code and higher maintenance cost, especially for sharded.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternatively, we could remove the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing the check for DDP would affect DeepSpeed and fully sharded. Regarding code duplication, I think if we better abstract the subprocess launch or start_processes in the DDP and DDP spawn plugins to ensure that code can be shared, would this address your concern? Are there other parts of the code you're worried about duplication. My concern with the inheritance we have now is if things are silently not called. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tchaton especially with FSDP and Deepspeed, checkpoint loading and saving is so different from ddp and sharded There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO, I would prefer to avoid it, but it can have some pros too as you shared there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tchaton I designed it based on inheritance to avoid code duplication. However, as we get more and more different kinds of plugins, I think it could make sense to split them out to minimal mixins (like the one @ananthsub shared) shared above and then make the actual plugin inherit them. I know that we decided against mixins, but I think those mixins together with a purely abstract interface class are the best way to tackle this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good to me ! @ananthsub Mind creating a [RFC] For Refactoring Accelerator around base components and tag the name of the person assign on your side. |
||
with self.model.no_sync(): | ||
yield None | ||
else: | ||
yield None | ||
Comment on lines
+447
to
+458
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. n00b question: do you think this ought to be its own mini interface to represent this trait?
this way, we only need to check otherwise i'm not sure about the (pls ignore the verbose naming) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's a good point! If custom plugin needs this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tchaton @awaelchli @justusschock what do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think we can explore making plugins more composable too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ananthsub I agree that we should explore this. My only concern in this direction (we had/have something similar for the trainer and module) is that sometimes it becomes hard to track what is implemented where (especially when debugging), which is why at some point we decided to avoid patterns like this. I still think though, that together with good purely abstract interfaces this should be possible and is likely the best way to tackle this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think it is a balance of good code taste with reliable / general abstractions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DDPPlugin or DDPSpawnPlugin right ?