Feature/double precision #6595

ethanwharris · 2021-03-19T11:09:14Z

What does this PR do?

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

justusschock

Hi @ethanwharris ,

Great to see you again :)

And great PR!
Just some smaller comments :)

pytorch_lightning/plugins/precision/double.py

tests/plugins/test_double_plugin.py

ethanwharris · 2021-03-19T14:00:48Z

Hi @justusschock,

Thanks for the great suggestions, all done :)

williamFalcon · 2021-03-19T19:10:59Z

omg.... i love this feature
😍😍😍😍😍

rohitgr7 · 2021-03-20T11:01:00Z

can we add support with deepspeed too?
https://github.com/PyTorchLightning/pytorch-lightning/blob/3a56a6024e0d8b239801cce558381807c24ba3d0/pytorch_lightning/plugins/training_type/deepspeed.py#L46-L48

not sure why it's handled this way in here, not an expert with deepspeed.

pytorch_lightning/trainer/connectors/accelerator_connector.py

pytorch_lightning/plugins/precision/double.py

SeanNaren · 2021-03-21T22:30:38Z

Thanks for the awesome contribution @ethanwharris!

Out of curiosity, if the user doesn't use the forward function of the LightningModule and instead calls independent modules, does it still work? i.e if I define my training_step like:

 def training_step(self, batch, batch_idx):
        output = self.layer((batch, torch.ones_like(batch).long())) # use layer instead of self(...)

tchaton

Nice !

ethanwharris · 2021-03-22T10:37:52Z

@SeanNaren That's an excellent point, totally forgot about that haha.

@SeanNaren @carmocca @justusschock I've made some changes to address your comments:

Moved patch logic to seperate class that now handles teardown (much cleaner than before)
Added patch to training_step etc. methods so that it all still works if the model is called directly without forward (and added a test of this behaviour)

@rohitgr7 Not sure about deepspeed, don't feel comfortable trying to add it as I don't currently have a way of testing it out. It seems that at least DeepSpeed ZeRO requires precision=16:
https://github.com/PyTorchLightning/pytorch-lightning/blob/853523ee643fe0f0cc30d40d9e85a8869e7edfd8/tests/plugins/test_deepspeed_plugin.py#L193

Should all be ready for review now 😃

tests/helpers/boring_model.py

tests/plugins/test_double_plugin.py

pytorch_lightning/plugins/precision/double.py

carmocca

LGTM. Can you have a look at the failing tests?

justusschock

2 more minor comments, which allow subclassing (and don't hardcode the class name, else LGTM

pytorch_lightning/plugins/precision/double.py

ethanwharris · 2021-03-23T10:38:23Z

@carmocca I have finally fixed the tests 😩. trainer.predict calls torch.set_grad_enabled(False) which was then killing gradients for subsequent tests. I've added a call to torch.set_grad_enable(True) which fixes this for now. I'll open an issue to change the predict behaviour to re-enable grads, maybe just switch to with torch.no_grad.

Co-authored-by: Carlos Mocholí <[email protected]>

Co-authored-by: Rohit Gupta <[email protected]>

Co-authored-by: Justus Schock <[email protected]>

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

ethanwharris requested review from awaelchli, Borda, carmocca, edenlightning, justusschock, SeanNaren, tchaton and williamFalcon as code owners March 19, 2021 11:09

ethanwharris added the feature Is an improvement or enhancement label Mar 19, 2021

justusschock reviewed Mar 19, 2021

View reviewed changes

rohitgr7 reviewed Mar 20, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Show resolved Hide resolved

carmocca reviewed Mar 21, 2021

View reviewed changes

pytorch_lightning/plugins/precision/double.py Outdated Show resolved Hide resolved

ethanwharris marked this pull request as draft March 22, 2021 08:57

tchaton approved these changes Mar 22, 2021

View reviewed changes

ethanwharris marked this pull request as ready for review March 22, 2021 10:37

carmocca reviewed Mar 22, 2021

View reviewed changes

carmocca added this to the 1.3 milestone Mar 22, 2021

rohitgr7 reviewed Mar 22, 2021

View reviewed changes

pytorch_lightning/plugins/precision/double.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/precision/double.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/precision/double.py Outdated Show resolved Hide resolved

carmocca approved these changes Mar 22, 2021

View reviewed changes

ananthsub approved these changes Mar 23, 2021

View reviewed changes

mergify bot added the has conflicts label Mar 23, 2021

justusschock approved these changes Mar 23, 2021

View reviewed changes

pytorch_lightning/plugins/precision/double.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/precision/double.py Outdated Show resolved Hide resolved

ethanwharris force-pushed the feature/double_precision branch from 4816bda to 11b5bb8 Compare March 23, 2021 08:56

mergify bot removed the has conflicts label Mar 23, 2021

ethanwharris and others added 19 commits March 23, 2021 16:10

Switch to .double()

df6d847

Add check for original float32 data

72c9be4

Enhance tests for double precision

423302f

Update tests/plugins/test_double_plugin.py

b654be2

Co-authored-by: Carlos Mocholí <[email protected]>

Update tests/plugins/test_double_plugin.py

b9c662b

Co-authored-by: Carlos Mocholí <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

dd608b3

Co-authored-by: Carlos Mocholí <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

982767a

Co-authored-by: Carlos Mocholí <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

f92dd2c

Co-authored-by: Carlos Mocholí <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

68dce05

Co-authored-by: Carlos Mocholí <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

e8af281

Co-authored-by: Carlos Mocholí <[email protected]>

Move RandomFloatIntDataset

9a8c021

Fix type hint

6489776

Update pytorch_lightning/plugins/precision/double.py

f527a41

Co-authored-by: Rohit Gupta <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

3da2d05

Co-authored-by: Rohit Gupta <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

2e74cff

Co-authored-by: Rohit Gupta <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

a7507ad

Co-authored-by: Justus Schock <[email protected]>

Update pytorch_lightning/plugins/precision/double.py

fa323a2

Co-authored-by: Justus Schock <[email protected]>

Add type hints to args and kwargs

23b21c5

Fix failing tests

210fd87

ethanwharris requested review from ananyahjha93 and teddykoker as code owners March 23, 2021 16:15

ethanwharris force-pushed the feature/double_precision branch from ffcd941 to 210fd87 Compare March 23, 2021 16:17

ethanwharris removed request for ananyahjha93 and teddykoker March 23, 2021 16:17

ethanwharris added 3 commits March 23, 2021 16:20

Switch predict to predict_step

e7b6c7f

Merge branch 'master' into feature/double_precision

925d109

Remove line from test no longer needed

59c093b

rohitgr7 approved these changes Mar 24, 2021

View reviewed changes

kaushikb11 merged commit d02fe34 into Lightning-AI:master Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/double precision #6595

Feature/double precision #6595

ethanwharris commented Mar 19, 2021 •

edited

Loading

justusschock left a comment

ethanwharris commented Mar 19, 2021

williamFalcon commented Mar 19, 2021

rohitgr7 commented Mar 20, 2021

SeanNaren commented Mar 21, 2021

tchaton left a comment

ethanwharris commented Mar 22, 2021 •

edited

Loading

carmocca left a comment

justusschock left a comment

ethanwharris commented Mar 23, 2021 •

edited

Loading

Feature/double precision #6595

Feature/double precision #6595

Conversation

ethanwharris commented Mar 19, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

justusschock left a comment

Choose a reason for hiding this comment

ethanwharris commented Mar 19, 2021

williamFalcon commented Mar 19, 2021

rohitgr7 commented Mar 20, 2021

SeanNaren commented Mar 21, 2021

tchaton left a comment

Choose a reason for hiding this comment

ethanwharris commented Mar 22, 2021 • edited Loading

carmocca left a comment

Choose a reason for hiding this comment

justusschock left a comment

Choose a reason for hiding this comment

ethanwharris commented Mar 23, 2021 • edited Loading

ethanwharris commented Mar 19, 2021 •

edited

Loading

ethanwharris commented Mar 22, 2021 •

edited

Loading

ethanwharris commented Mar 23, 2021 •

edited

Loading