Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model Parallel] Add configure sharded model hook #6679

Merged
merged 30 commits into from
Mar 29, 2021

Conversation

kaushikb11
Copy link
Contributor

@kaushikb11 kaushikb11 commented Mar 25, 2021

What does this PR do?

Adds a configure sharded model hook. This is required for both DeepSpeed and Fully Sharded. Both teams have defined a context in which you can wrap model layers to be sharded instantly.

Why does it need to be a hook?

It needs to be a hook because model layers can only be sharded instantly after torch distributed is setup. This ensures that the distributed environment is setup, then calls the hook.

We also wrap the hook with the model parallel context for DeepSpeed/Fully Sharded so the user doesn't need to do anything extra.

What about fit/test/predict?

It will be treated like the setup hook for now, where we assume we're either using fit/test/predict and can call the hook for the user. We can iterate on this in further PRs.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Mar 25, 2021

Hello @kaushikb11! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-29 19:15:47 UTC

@mergify mergify bot removed the has conflicts label Mar 25, 2021
@kaushikb11 kaushikb11 added the feature Is an improvement or enhancement label Mar 25, 2021
@kaushikb11 kaushikb11 added this to the 1.3 milestone Mar 25, 2021
@codecov
Copy link

codecov bot commented Mar 25, 2021

Codecov Report

Merging #6679 (aa35583) into master (646cf2f) will decrease coverage by 4%.
The diff coverage is 98%.

@@           Coverage Diff           @@
##           master   #6679    +/-   ##
=======================================
- Coverage      91%     87%    -4%     
=======================================
  Files         192     192            
  Lines       12189   12231    +42     
=======================================
- Hits        11143   10659   -484     
- Misses       1046    1572   +526     

pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved
pytorch_lightning/core/hooks.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

class TestModel(BoringModel):

def on_model_parallel_setup(self):
Copy link
Contributor

@tchaton tchaton Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check that the context manager is actually being yield.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it made me see an Issue in implementation. Fixinggg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to check the hook is not called within the callbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can assume Users would be aware of the implications.

pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved
Comment on lines 32 to 34
def on_model_parallel_setup(self, trainer, pl_module: LightningModule) -> None:
"""Called before model parallel accelerator setup"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But those two snippets you shared don't use on_model_parallel_setup, they use model_parallel_context.

I think the question is more along the lines of "when would you override on_model_parallel_setup?"

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
@SeanNaren SeanNaren changed the title Add model parallel setup hook [Model Parallel] Add model parallel setup hook Mar 29, 2021
@SeanNaren SeanNaren changed the title [Model Parallel] Add model parallel setup hook [Model Parallel] Add configure sharded model hook Mar 29, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@SeanNaren SeanNaren self-assigned this Mar 29, 2021
SeanNaren added 2 commits March 29, 2021 20:05
# Conflicts:
#	pytorch_lightning/accelerators/accelerator.py
#	pytorch_lightning/plugins/training_type/training_type_plugin.py
@SeanNaren SeanNaren enabled auto-merge (squash) March 29, 2021 19:06
@mergify mergify bot removed the has conflicts label Mar 29, 2021
@SeanNaren SeanNaren merged commit f79a13e into master Mar 29, 2021
@SeanNaren SeanNaren deleted the feat/model_parallel_hook branch March 29, 2021 20:50
Comment on lines +1086 to +1094
def call_configure_sharded_model(self, model: LightningModule) -> None:
# Call configure sharded model hook if accelerator requests. In some cases
# we will not call the hook; the hook has initialized the sharded model for example.
if self.accelerator.call_configure_sharded_model_hook:
with self.accelerator.model_sharded_context():
model.configure_sharded_model()
self.configure_sharded_model(model)
self.accelerator.call_configure_sharded_model_hook = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeanNaren @kaushikb11 why's this needed in the trainer? could this be part of the accelerator setup? in L440 instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which part exactly? the trainer has control over the hook, we do not want the hook to be called within the accelerator i think

Copy link
Contributor

@ananthsub ananthsub Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, on first read I thought that we could do this:

self.call_setup_hook(model)
self.accelerator.setup(self, model)  -> inside here apply the model sharded context and call model.configure_sharded_model()

# now that accelerator has finished doing the model sharding setup call the callbacks
self.configure_sharded_model(model)

the upside is that more of the training type + accelerator logic lives inside of the accelerator internals
but the downside is that this splits the callback hook from the sharded model setup which can be less convenient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree as long as we're ok with calling the model hook within the accelerator (cc @justusschock @awaelchli)

facebook-github-bot pushed a commit to facebookresearch/d2go that referenced this pull request Apr 14, 2021
…ter) to github/third-party/PyTorchLightning/pytorch-lightning

Summary:
### New commit log messages
## [UnReleased] - 2021-MM-DD

### Added

- Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667))

- Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492))

- Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417))

- Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)).

- Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470))

- Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072))

- Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948))

- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915))

- Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673))

- Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277))

- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274))

- Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370))

- Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633))

- Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139))

- Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543))

- Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621))

- Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618))

- Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120))

- Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679))

- Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595))

- Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736))

- Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677))

- Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764))

### Changed

- Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259))

- Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621))

- Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349))

### Deprecated

- `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621))

- Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505),
    [#6530](Lightning-AI/pytorch-lightning#6530),
    [#6540](Lightning-AI/pytorch-lightning#6540),
    [#6547](Lightning-AI/pytorch-lightning#6547),
    [#6515](Lightning-AI/pytorch-lightning#6515),
    [#6572](Lightning-AI/pytorch-lightning#6572),
    [#6573](Lightning-AI/pytorch-lightning#6573),
    [#6584](Lightning-AI/pytorch-lightning#6584),
    [#6636](Lightning-AI/pytorch-lightning#6636),
    [#6637](Lightning-AI/pytorch-lightning#6637),
    [#6649](Lightning-AI/pytorch-lightning#6649),
    [#6659](Lightning-AI/pytorch-lightning#6659),
)

### Removed

- Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164))

- Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139))

- Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166))

- Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163))

- Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161))
    * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve`
    * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce`

- Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162))

- Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167))

- Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016))

- Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207))

- Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734))

- Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093))

### Fixed

- Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802))

- Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011))

- Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070))

- Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109))

- Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436))

- Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416))

- Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506))

- Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705))

## [1.2.7] - 2021-04-06

### Fixed

- Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741))
- Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828))
- Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836))
- Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588))
- Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816))
- Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730))

## [1.2.6] - 2021-03-30

### Changed

- Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498))

### Removed

- Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682))

### Fixed

- Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398))
- Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657))
- Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719))

## [1.2.5] - 2021-03-23

### Changed

- Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576))
- Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590))

### Fixed

- Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587))
- Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434))
- Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275))
- Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565))

Reviewed By: shuyingsunshine21

Differential Revision: D27528929

fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants