Expose DeepSpeed FP16 parameters due to loss instability #6115

SeanNaren · 2021-02-21T13:31:59Z

What does this PR do?

Received a few reports that DeepSpeed quickly gives off NaNs when using ZeRO. This seems to be related to default Loss scaling values, which are not exposed currently. As a result the user needs to override everything to set these values.

Expose these values so that the user can access them. I thought about putting the values into the Precision plugin, but then I'd need the training_type_plugin to be aware of the precision plugin and I'm not a fan of that, given a longer term plan of the Training Type plugin handling precision.

Also I've included a few tests of edge cases that needed to be checked + seeing if we need to run the single GPU tests as special tests (might need to revert this, we'll see).

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

…y in parameters

tests/plugins/test_deepspeed_plugin.py

codecov · 2021-02-21T13:34:00Z

Codecov Report

Merging #6115 (50224ee) into master (3b0e4e0) will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #6115   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         160     160           
  Lines       11405   11405           
======================================
- Hits        10661   10629   -32     
- Misses        744     776   +32

tests/plugins/test_deepspeed_plugin.py

pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <[email protected]>

* Expose deepspeed config parameters to init function due to instability in parameters * See if tests can run on normal CI, without special tests * Add changelog * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> (cherry picked from commit 432e563)

* Expose deepspeed config parameters to init function due to instability in parameters * See if tests can run on normal CI, without special tests * Add changelog * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> (cherry picked from commit 432e563) Add missing config

* Add hint in docs for how to use shared memory (#6036) * Prevent flickering progress bar (#6009) * add padding * fix * fix * Update pytorch_lightning/callbacks/progress.py Co-authored-by: Carlos Mocholí <[email protected]> * updated based on suggestion * changelog * add test * fix pep8 * resolve test * fix code format Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: tchaton <[email protected]> * Fix Wrapping optimizers upon assignment (#6006) * Update properties.py * pep8 * [Bugfix] Apply untoggle_optimizer when result is None (#5983) * update changelog * apply untoggle_optimizer when result is None * update tests * still return loss sometimes * Update CHANGELOG.md Co-authored-by: deng-cy <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * remove outdated info (#6032) * DeepSpeed Integration (#5954) * Add initial deepspeed changes * Address code review * Move static method outside of function * Fixes * Add missing annotation * Remove seed setting * Doc changes * Doc changes, add address reviews * Fix docs * Try fixing issue by moving to torch adam * Clean up check * Changes, better APIs! * Add wrapper, swap to git install revision * Add special test * Add warning * Address review * Add better disclaimer * Turn off ZeRO for testing due to compilation * Add description on modifying parameters via the plugin * Doc strings clear * Small doc fixes * Fix hash, reduce test * Added CI change * Move to azure pipeline * Fix test name * Add missing flag * Remove sudo... * Try conda instead * Swap to conda base * Try suggested install * Apply suggestions from code review * Apply suggestions from code review * Revert "Apply suggestions from code review" This reverts commit 41cca05a * Revert "Apply suggestions from code review" This reverts commit e06ec29e * Remove setter * Address most review * Move out function, remove DeepSpeed from requirements * Install deepspeed/mpi4py within container * Use special tests, move to master commit for deepspeed * Export path * Force compile to happen first * Remove! * Debugging ninja * Fix error in optimizer step logic * Attempt to fix symbolic link * Reverse to aid debugging * Export path again * Clean up mess * var * Revert "var" This reverts commit 3450eaca * Address review, add todo * Add note about unsupported functionality Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: tchaton <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * Trainer only references accelerator (#6039) * Trainer only references accelerator where it can * Move teardown to the trainer, as it is reponsible for the accelerator * Address code review for deepspeed (#6042) * [feat] Add Trainer(stochastic_weight_avg=True/False) (#6038) Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Kaushik B <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> * [CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043) * Move to CUDA image * Remove deepspeed install as deepspeed now in the cuda image * Remove path setting, as ninja should be in the container now * drop deprecated result object 1/n (#5005) * ro1 * ro2 * Add option for weight tying on TPU's (#5441) * added on_post_move_to_device * added tests * docs and refactors * Update tests/backends/test_tpu_backend.py Co-authored-by: Jirka Borovec <[email protected]> * Update docs/source/tpu.rst Co-authored-by: Jirka Borovec <[email protected]> * Update docs/source/tpu.rst Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/core/decorators.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/core/decorators.py Co-authored-by: Jirka Borovec <[email protected]> * Update docs/source/tpu.rst Co-authored-by: Rohit Gupta <[email protected]> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <[email protected]> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <[email protected]> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <[email protected]> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <[email protected]> * Update pytorch_lightning/core/hooks.py Co-authored-by: Rohit Gupta <[email protected]> * moved weight sharing module back to test updated tpu available * add count to warning * fix doctest * import trainer in doctest * import trainer in doctest * do not test code as no TPU device * param count to layer count * formatting * update docs * update import * update * resolve tests * remove legacy accelerator Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: tchaton <[email protected]> Co-authored-by: Your Name <[email protected]> * Delete tests.helpers.TrialMNISTDataModule (#5999) * Remove TrialMNISTDataModule * Allow using TrialMNIST in the MNISTDataModule * Update tests/helpers/datasets.py * Fix: Allow hashing of metrics with lists in their state (#5939) * Fix: Allow hashing of metrics with lists in their state * Add test case and modify semantics of Metric __hash__ in order to be compatible with structural equality checks * Fix pep8 style issue Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * et al. (#6050) * et al. * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: chaton <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: chaton <[email protected]> * [ModelPruning] Add missing attribute with use_global_unstructured=False and verbose (#6045) * fix/test quant (#6040) * fix/test quant * ... * --- * Add descriptions to accelerator broadcast function/clean up all_gather (#6044) * Add descriptions to accelerator broadcast function/clean up all_gather * Remove todo * Add before_batch_transfer and after_batch_transfer hooks (#3671) * add hooks * comment * docs * add tests * make it private * fix tests * docs * chlog * testcode * codefactor * fix doctest * fix doctest * suggestions * is always overriden * pep and BoringModel * BoringModel * docs * docs * docs * fix * rebase * rebase * suggestions * docs * suggestions * try fix docs * docs * update name * yapf * docs * rebase * yapf * Make parallel devices optional across all plugins (#6051) * Make parallel devices optional across all plugins so that they can be instantiated * Add any to types to capture vars passed in * clarify gpu / process (#6049) * Fix docs typo (#6055) Put .test() in code blocks * Docs for Pruning, Quantization, and SWA (#6041) Co-authored-by: chaton <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Sean Naren <[email protected]> * Replace .get_model() with explicit .lightning_module (#6035) * rename get_model -> lightning_module * update references to get_model * pep8 * add proper deprecation * remove outdated _get_reference_model * fix cyclic import * rename accelerator_backend -> accelerator (#6034) * rename accelerator backend * rename new additions from master * add proper deprecation * pep8 * warning match * add missing warning type * fix flake8 for new plugins (#5951) * flake8 * fix cyclic import * isort * fix docs links (#6057) * Add warnings to on_before/after_batch_transfer hooks (#6059) * Add warnings to hooks * Add default idx to prevent signature change in the future * Nothing to see here * Add default val to transfer_batch_to_device hook * Apply suggestions from code review Co-authored-by: Jirka Borovec <[email protected]> * Revert "Add default val to transfer_batch_to_device hook" This reverts commit 5c6a68f2 Co-authored-by: Jirka Borovec <[email protected]> * v1.2.0rc2 (#6063) * v1.2.0rc2 * chlogs * chlogs * format * Apply suggestions from code review Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * Update auto-opt docs (#6037) * fix docs * update on comments * Apply suggestions from code review Co-authored-by: Nicki Skafte <[email protected]> * Apply suggestions from code review Co-authored-by: Nicki Skafte <[email protected]> * Apply suggestions from code review Co-authored-by: Carlos Mocholí <[email protected]> * rm comment * Update docs/source/common/lightning_module.rst Co-authored-by: chaton <[email protected]> Co-authored-by: Nicki Skafte <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: chaton <[email protected]> * Raise AttributeError in lightning_getattr and lightning_setattr when attribute not found (#6024) * Empty commit * Raise AttributeError instead of ValueError * Make functions private * Update tests * Add match string * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> * lightning to Lightning Co-authored-by: Adrian Wälchli <[email protected]> * default sched (#6062) * v1.2.0 (#6065) * v1.2.0 * docs * add Azure tags trigger (#6066) * add Azure tags trigger * fix * mnodes * pypi azure badges - tags (#6068) * pypi azure badges - tags * pep8 * id * continue towards 1.3 (#6069) * Fix amp autocast (#6080) * precision fixes * add amp test model * fix test * revert * move assert to training step * fix test * fix test * remove unrelated changes * add changelog * remove unused import * add sanity check on nb available GPUs (#6092) * consistent behavior for reduce method across all Plugins (#6011) * reduction docs * docs for abstract base method * make mean the default * add preliminary chlog Co-authored-by: Jirka Borovec <[email protected]> * [Hot Fix] Give priority to plugins to set distributed mode, and then accelerator (#6089) * Give priority to plugins to set distributed mode, and then accelerator * Add CHANGELOG.md * Update CHANGELOG.md * Remove very scary line * Ensure we set cluster environment after slurm configured if necessary * Simplify the fix with a reset Co-authored-by: Carlos Mocholí <[email protected]> * Enable ZeRO tests for CI, fix to/half function calls (#6070) * Enable ZeRO optimization, and make sure that the lightning module hook is called when we move to half precision * Added test, update to function * Expose DeepSpeed FP16 parameters due to loss instability (#6115) * Expose deepspeed config parameters to init function due to instability in parameters * See if tests can run on normal CI, without special tests * Add changelog * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> * Collapse 2 DeepSpeed tests (#6108) * fix amp/apex misconfiguration error for cpu (#6107) * fix weird test * fix apex plugin test * fix raise * cpu test * fix type * add changelog * Update Contributing Guide (#6118) * Update Contributing Guide * update docs * Minor fixes/improvements in Metric docs (#6114) * Fix wrong render * Improve classification metrics docs * Improve other domain metrics docs * Change the structure level in the docs * Avoid printing ModelCheckpoint log with monitor=None and verbose=True (#6109) * Feature/5275 clean progress bar print (#5470) * Trainer.test should return only test metrics (#5214) * resolve bug * merge tests * Fix metric state reset (#5273) * Fix metric state reset * Fix test * Improve formatting Co-authored-by: Ananya Harsh Jha <[email protected]> * print() method added to ProgressBar * printing alongside progress bar added to LightningModule.print() * LightningModule.print() method documentation updated * ProgressBarBase.print() stub added * stub * add progress bar tests * fix isort * Progress Callback fixes * test_metric.py duplicate DummyList removed * PEP and isort fixes * CHANGELOG updated * test_progress_bar_print win linesep fix * test_progress_bar.py remove whitespaces * Update CHANGELOG.md Co-authored-by: chaton <[email protected]> Co-authored-by: Tadej Svetina <[email protected]> Co-authored-by: Ananya Harsh Jha <[email protected]> Co-authored-by: Alexander Snorkin <[email protected]> Co-authored-by: rohitgr7 <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> * mini refactor for _running_stage access (#5724) * running stage * circular import * running stage cleanup * fix unused import * fix running stage access * add return type * Revert "add return type" This reverts commit 65b0fe269c6547213e34b6a88b97bee31cdfe8c7. * try fix typing * Add specifics around DeepSpeed docs (#6142) * Be more specific with DeepSpeed compatibility * Better wording * Ensure accelerator is valid if running interactively (#5970) Co-authored-by: chaton <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> * fixing miss-leading tested acc values (#5876) * fixing tested values * . * tests * yapf * softmax * hvd * rename * lr * duplicate * drop * classif * rm EvalModel * Revert "rm EvalModel" This reverts commit 6c3fb39ebe0c4bfb52357bccfd050438f2c0f31c. * update tests * fix * azure * azure * self * cpu * Apply suggestions from code review Co-authored-by: rohitgr7 <[email protected]> * Update CHANGELOG (#6156) * prune deprecated profiler as bool (#6164) * prune profiler * chlog * prune deprecated Trainer arg `enable_pl_optimizer` (#6163) * prune enable_pl_optimizer * prune automatic_optimization * Prune deprecated metrics for 1.3 (#6161) * prune deprecated metrics for 1.3 * isort / yapf * [Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 (#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <[email protected]> * Prune deprecated checkpoint arguments (#6162) * prune prefix * prune mode=auto * chlog * Prune deprecated EarlyStopping(mode='auto') (#6167) Co-authored-by: Roger Shieh <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * Fix typo (#6178) * Update issue template to use discussions for questions (#6155) * add issue config * remove question template * update URL * Update README.md * Update README.md Co-authored-by: Rohit Gupta <[email protected]> * Update .github/ISSUE_TEMPLATE/config.yml Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * Update with GitHub Discussions (#6186) * Update gpu warning (#6181) Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Kaushik Bokka <[email protected]> * type accelerators (#6148) * Fix for multiple callbacks (#6197) * Fix for multiple callbacks * Add CHANGELOG.md * Remove old params * Skip tests on windows using ddp * Change name of the variable to not clash with should stop, which is separate * Apply suggestions from code review * Fix params Co-authored-by: Jirka Borovec <[email protected]> * Add checkpoint parameter to on_save_checkpoint (#6072) Co-authored-by: Kaushik B <[email protected]> * Document exceptions in loggers (#6171) * Document exceptions in loggers * minor formatting * docstring changed in comet.py * Apply suggestions from code review Co-authored-by: Rohit Gupta <[email protected]> * Prune deprecated Trainer(checkpoint_callback=ModelCheckpoint()) (#6166) * fix parallel devices return type & add copyright (#6215) * Add mypy typing to precision plugins. (#6149) Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> * apply_func.py: from torchtext.legacy.data import Batch (#6211) * Update apply_func.py The name Batch is no longer located under torchtext.data --Error message-- File "/home/daniel/py38/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 25, in <module> from torchtext.data import Batch ImportError: cannot import name 'Batch' from 'torchtext.data' (/home/daniel/py38/lib/p ython3.8/site-packages/torchtext/data/__init__.py) You can fix this by changing line line 28 to: from torchtext.legacy.data import Batch * Update apply_func.py * Update apply_func.py * Update apply_func.py * Update apply_func.py * Update apply_func.py * fix(wandb): prevent WandbLogger from dropping values (#5931) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: chaton <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * Prune deprecated hparams setter (#6207) * document exceptions for metrics/regression (#6202) Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Prajakta Phadke <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * simplify skip-if tests >> 0/n (#5920) * skipif + yapf + isort * tests * docs * pp * update (#6237) * Document Exceptions in profilers (#6229) * docstring changes in profilers * minor changes in profilers.py * Call `optimizer.zero_grad()` before backward inside closure in AutoOpt (#6147) Co-authored-by: Carlos Mocholi <[email protected]> * Fix for incorrect usage of detach(), cpu(), to() (#6216) * Fix for incorrect detach/cpu calls (#6214) * Fix incorrect use of detach(), to(), and cpu(), #6214 * Fix incorrect use of detach() and cpu(), #6214 * update pr * add typing * chlog * more... * revert on module * update on comments * revert changes on model Co-authored-by: tchaton <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * add skipif warpper (#6258) * cleaning SWA (#6259) * rename * if * test * chlog * Remove opt from manual_backward in docs (#6267) * switch agents pool (#6270) * docstring changes in tuner (#6264) * docstring changes in tuner * added full stop * Disable CPU Offload as default for DeepSpeed (#6262) * Change default for CPU offload to false for best throughput/memory efficiency * Add changelog * default Co-authored-by: Jirka Borovec <[email protected]> * split profilers (#6261) * Refactor: skipif for multi - gpus 1/n (#6266) * ngpus * gpu * isort * pt * flake8 * Improved EarlyStopping.patience documentation (#6278) * Improved early stopping documentation * Changed to 120 column format * doc * doc * doc Co-authored-by: Jirka Borovec <[email protected]> * Refactor: skipif for Windows 2/n (#6268) * win * isort * flake8 * fix duplicate console logging bug v2 (#6275) Co-authored-by: chaton <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * Refactor: skipif for AMPs 3/n (#6293) * args * native * apex * isort * [fix] Ensure we check deepspeed/sharded in multinode DDP (#6297) * Ensure we check deepspeed/sharded in multinode * Add CHANGELOG.md * Add CHANGELOG.md * Drop mock, use actual multi-gpu node * unfreeze torchtext version (#6302) * Add possibility for custom naming when using multiple dataloaders (#6274) * try to fix imports for parsing (#6256) * try to fix imports * legacy 1.2.1 * Refactor: Runif for TPU and Horovod 5/n (#6301) * TPU * horovod * extra * fix * Apply suggestions from code review Co-authored-by: Nicki Skafte <[email protected]> * doc Co-authored-by: Nicki Skafte <[email protected]> * Refactor: runif for spec 6/6 (#6307) * special * rpc * Add fairscale & deepspeed to skipif 4/n (#6281) * add fairscale & windows to skipif * add deepspeed to runif * fairscale * deepspeed * flake8 Co-authored-by: Jirka Borovec <[email protected]> * [bugfix] TPU test hangs to barrier on 1 process (#6272) * update * resolve flake8 * update * update * update changelog * update * resolve flake8 Co-authored-by: Your Name <[email protected]> * prune duplicite test in optim (#6312) * Simplify test for AMP plugins (#6311) * AMP * fuse * yapf * Fix ModelPruning(make_pruning_permanent=True) buffers getting removed when saved during training (#6073) Co-authored-by: chaton <[email protected]> * [bugfix] TPU + all_gather + SingleTPU shouldn't call xm.all_gather (#6296) * resolve an issue with TPU * update * add changelog * drop unused variable in API (#6308) * drop unused pl model in ckpt * irelevant * on_evaluation_batch_start * evaluation_epoch_end * attach_datamodule * hotfix for PT1.6 and torchtext (#6323) * ci: azure reinstall torchtext * move * todos * 0.6.0 * skip examples * formatter * skip * todo * Apply suggestions from code review * [fix] Use training type plugin hook when saving (FSDP 1/n) (#6321) * Rely on training type plugin when saving * Add better typing to training type plugin * leaving lezwon (#6347) * Add `tests/utilities/test_parsing.py` (#4460) * Create branch tests/4400_parsing * Rename test file for parsing.py * Fix lightning_hasattr * Fix lightning_hasattr * Fix lightning_setattr * Add empty lines and remove rubbish spaces * Raise AttributeError not ValueError * Use getattr in hasattr * Remove rubbish spaces * Fix getattr * Fix by flake8 * Add tests for str_to_bool_or_str * Fix by flake8 * Add tests for str_to_bool * Add tests for is_picklable * Add tests for clean_namespace * Fix typo * Fix lightning_getattr * Add tests for AttributeDict * Add tests for flatten_dict * Fix by flake8 * Apply suggestions from code review Co-authored-by: Jirka Borovec <[email protected]> * Apply isort * Revert "Apply suggestions from code review" * Define unpicklable_function outside * Add comment to test_clean_namespace * Add tests for parse_class_init_keys * Add tests for get_init_args and collect_init_args * Share objects across the tests Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Ethan Harris <[email protected]> * Add ignore param to save_hyperparameters (#6056) * add ignore param to save_hyperparameters * add docstring for ignore * add type for frame object * Update pytorch_lightning/core/lightning.py Co-authored-by: Nicki Skafte <[email protected]> * Update pytorch_lightning/core/lightning.py Co-authored-by: Nicki Skafte <[email protected]> * fix whitespace * Update pytorch_lightning/core/lightning.py Co-authored-by: Nicki Skafte <[email protected]> * Parametrize tests * Update pytorch_lightning/core/lightning.py Co-authored-by: Rohit Gupta <[email protected]> * Update pytorch_lightning/core/lightning.py Co-authored-by: Rohit Gupta <[email protected]> * seq * fix docs * Update lightning.py * Update lightning.py * fix docs errors * add example keyword * update docstring Co-authored-by: Nicki Skafte <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * Fix when _stable_1d_sort to work when n >= N (#6177) * Fix when _stable_1d_sort to work when n >= N * Apply suggestions Co-authored-by: Carlos Mocholi <[email protected]> * Update docs on arg train_dataloader in fit (#6076) * add to docs * update docs * Apply suggestions from code review * Update pytorch_lightning/core/hooks.py Co-authored-by: Rohit Gupta <[email protected]> * nested loaders * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> * shorten text length * Update pytorch_lightning/core/hooks.py Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * missing tests default_root_dir=tmpdir (#6314) * default_root_dir=tmpdir * miss * Document exception for metrics/classification (#6190) * document exception for metrics/classification * minor formatting fixes * fix trailing whitespaces * document exception for metrics * Apply suggestions from code review Co-authored-by: Nicki Skafte <[email protected]> * Apply suggestions from code review Co-authored-by: Nicki Skafte <[email protected]> * Apply suggestions from code review Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Nicki Skafte <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> * [Fix] Call clip gradients if clip val greater than 0 (#6330) * Call clip gradients if clip val greater than 0 * format * Format * Move to top of file * [bugfix] Check LightningOptimizer doesn't delete optimizer hooks (#6305) * update * resolve bug * docstring changes in accelerators (#6327) * docstring changes in accelerators * docstrings moved * whitespaces removed * PEP8 correction[1] * [bugfix] Perform reduction for dict in training_step and DP (#6324) * fix * update * update * add changelog * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <[email protected]> * Update tests/accelerators/test_dp.py Co-authored-by: Carlos Mocholí <[email protected]> * update changelog Co-authored-by: Carlos Mocholí <[email protected]> * introduce default cluster environment for lightning-specific ddp (#5915) * handle distributed_sampler_kwargs * move emptying cache to accelertor * fix a few tests * restoring the result from subprocess * fix queue.get() order for results * add missing "block_backward_sync" context manager * add missing "block_backward_sync" context manager * fix sync_batchnorm * fix supported gpu-ids for tuple * fix clip gradients and inf recursion * accelerator selection: added cluster_environment plugin * fix torchelastic test * fix reduce early stopping decision for DDP * fix tests: callbacks, conversion to lightning optimizer * fix lightning optimizer does not pickle * fix setting benchmark and deterministic option * fix slurm amp test * fix prepare_data test and determine node_rank * fix retrieving last path when testing * remove obsolete plugin argument * fix test: test_trainer_config * fix torchscript tests * fix trainer.model access * move properties * fix test_transfer_batch_hook * fix auto_select_gpus * fix omegaconf test * fix test that needs to simulate slurm ddp * add horovod plugin * fix test with named arguments * clean up whitespace * fix datamodules test * remove old accelerators * fix naming * move old plugins * move to plugins * create precision subpackage * create training_type subpackage * fix all new import errors * fix wrong arguments order passed to test * fix LR finder * Added sharded training type and amp plugin * Move clip grad to precision plugin * Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically * Fix import issue, attempting to fix tests * Fix initial test * Reflect hook logic from master, should wrap model after move to device * Optional state consolidation, since master has optimizers not wrapped * change attribute for instance test * reset optimizers optimizers are not used in main process, so state would be wrong. * legacy * imports in accel * legacy2 * trainer imports * fix import errors after rebase * move hook to new setup location * provide unwrapping logic * fix trainer callback system * added ddp2 implementation * fix imports .legacy * move plugins * restore legacy * drop test.py from root * add tpu accelerator and plugins * fixes * fix lightning optimizer merge * reset bugreportmodel * unwrapping * step routing forward * model access * unwrap * opt * integrate distrib_type * sync changes * sync * fixes * add forgotten generators * add missing logic * update * import * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d * add world size * clean up * duplicate * activate ddp_sharded and tpu * set nvidia flags * remove unused colab var * use_tpu <-> on_tpu attrs * make some ddp_cpu and clusterplugin tests pass * Ref/accelerator connector (#5742) * final cleanup Co-authored-by: Adrian Wälchli <[email protected]> * connector cleanup Co-authored-by: Adrian Wälchli <[email protected]> * trainer cleanup Co-authored-by: Adrian Wälchli <[email protected]> * accelerator cleanup + missing logic in accelerator connector Co-authored-by: Adrian Wälchli <[email protected]> * add missing changes to callbacks Co-authored-by: Adrian Wälchli <[email protected]> * reflect accelerator changes to lightning module Co-authored-by: Adrian Wälchli <[email protected]> * clean cluster envs Co-authored-by: Adrian Wälchli <[email protected]> * cleanup plugins Co-authored-by: Adrian Wälchli <[email protected]> * add broadcasting Co-authored-by: Adrian Wälchli <[email protected]> * yapf * remove plugin connector Co-authored-by: Adrian Wälchli <[email protected]> * plugins * manual optimization * update optimizer routing * add rank to torchelastic * fix memory mixed precision * setstate on trainer for pickling in ddp spawn * add predict method * add back commented accelerator code * adapt test for sync_batch_norm to new plugin * fix deprecated tests * fix ddp cpu choice when no num_processes are given * yapf format * skip a memory test that cannot pass anymore * fix pickle error in spawn plugin * x * avoid * x * fix cyclic import in docs build * add support for sharded * update typing * add sharded and sharded_spawn to distributed types * make unwrap model default * refactor LightningShardedDataParallel similar to LightningDistributedDataParallel * update sharded spawn to reflect changes * update sharded to reflect changes * Merge 1.1.5 changes * fix merge * fix merge * yapf isort * fix merge * yapf isort * fix indentation in test * copy over reinit scheduler implementation from dev1.2 * fix apex tracking calls with dev_debugger * reduce diff to dev1.2, clean up * fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu * sort plugin tests legacy/new * fix error handling for amp on cpu * fix merge fix merge fix merge * [Feat] Resolve manual_backward (#5837) * resolve manual_backward * resolve flake8 * update * resolve for ddp_spawn * resolve flake8 * resolve flake8 * resolve flake8 Co-authored-by: Ubuntu <[email protected]> * fix tests/accelerator tests on cpu * [BugFix] Resolve manual optimization (#5852) * resolve manual_optimization * update * update Co-authored-by: Ubuntu <[email protected]> * Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856) * resovle a bug * Accelerator refactor sharded rpc (#5854) * rpc branch * merge * update handling of rpc * make devices etc. Optional in RPC * set devices etc. later if necessary * remove devices from sequential * make devices optional in rpc * fix import * uncomment everything * fix cluster selection Co-authored-by: Ubuntu <[email protected]> * resolve bug * fix assert in rpc test * resolve a test * fix docs compilation * accelerator refactor - fix for sharded parity test (#5866) * fix memory issue with ddp_spawn * x x x x x x x x x * x * Remove DDP2 as this does not apply * Add missing pre optimizer hook to ensure lambda closure is called * fix apex docstring * [accelerator][BugFix] Resolve some test for 1 gpu (#5863) * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * update * resolve flake8 * update * update * update * update * update * all_gather * update * make plugins work, add misconfig for RPC * update * update * remove breaking test * resolve some tests * resolve flake8 * revert to ddp_spawn Co-authored-by: root <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Justus Schock <[email protected]> * yapf isort * resolve flake8 * fix apex doctests * fix apex doctests 2 * resolve docs * update drone * clean env * update * update * update * update * merge * Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881) * Fix RPC related tests, clean out old API, update for new accelerator API * Move tests out of legacy folder, update paths and names * Update test_remove_1-4.py * Expose properties for tpu cores/gpus/num_gpus * Add root GPU property * Move properties to properties.py * move tests that were previously in drone * Fix root GPU property (#5908) * Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator * Add missing tests back * fix best model path transfer when no checkpoint callback available * Fix setup hook order [wip] (#5858) * Call trainer setup hook before accelerator setup * Add test case * add new test * typo * fix callback order in test Co-authored-by: tchaton <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * rename ddp sequential -> rpc sequential for special test * revert * fix stupid merge problem * abstract the cluster plugins * default plugin * integrate default environment * fix property * adapt tests * adjust test * fix world size access * base cluster env * revert rebase errors * revert rebase errors * missing import * revert unrelated change * remove unused cluster local rank * remove unrelated changes * fix unrelated changes * fix pep8 * remove unused var * reset permissions * ypaf * test default environment * test torchelastic environment * world size as int * tests for slurm environment * changelog * test comments * remove unintended change * keep master port fixed after it is generated * test random master port * yapf * add missing default environment * move helper function * rename default environment * rename * rename * yapf * Update pytorch_lightning/plugins/environments/lightning_environment.py Co-authored-by: Carlos Mocholí <[email protected]> * Update CHANGELOG.md Co-authored-by: Justus Schock <[email protected]> * spawn -> create Co-authored-by: justusschock <[email protected]> Co-authored-by: SeanNaren <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: chaton <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> * [bugfix] Resolve memory leak for evaluation (#6326) * resolve bug * resolve flake8 * revert name * Update changelog for v1.2.2 (#6325) * update changelog for v1.2.2 * ckpr 1.2.2 Co-authored-by: Jirka Borovec <[email protected]> * CI: fix examples - patch download MNIST (#6357) * patch download * CI * isort * extra * [bug] Fix Pytorch profiler with emit_nvtx (#6260) * resolve bug * update changelog * Update tests/trainer/test_trainer.py * Update pytorch_lightning/profiler/profilers.py Co-authored-by: Jirka Borovec <[email protected]> * resolve comments * resolve flake8 Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * fix importing torchtext batch (#6365) * copy torchtext batch * update * rev * rev * give a more complete GAN example (#6294) * Refactor RunningStage usage in advance of implementing Trainer.validate() (#4945) * Update code Co-authored-by: EliaCereda * More property updates * Move properties. Introduce trainer._fitting * Use trainer.fitting * Fix reset dataloaders * Unused code * RunningStage.SANITY_CHECKING * Use setters * Fix bugs * Fix bugs * TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING} * Fix bugs * Fix bugs * Fix tests * Update CHANGELOG. Add deprecation warning. Fix tests * Unused imports * Optional trainer * More deprecation. More refactoring * Correct version * Use properties * Address comments * flake8 * Missed renamings * Typo * is -> == It is recommended to use for Enums since they are singletons, however, since the LightningEnum subclasses str, it's not a good idea in case a user sets the state/stage with a str * Also for tests * Typo * Address @tchaton's comments * PEP8 * Correct property * Update CHANGELOG * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> * Update pytorch_lightning/trainer/trainer.py Co-authored-by: Adrian Wälchli <[email protected]> * Remove called sanity check Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * require: adjust versions (#6363) * adjust versions * release * manifest * pep8 * CI * fix * build * Use f-"""-string in a Trainer comment (#6377) * Use f-"""-string * Add r * Use Trainer. * r -> noqa: W605 * Remove no return warning from val/test step (#6139) * remove warning * auto_opt * chlog * auto_opt * no_warning_call * rm old code * add warning for predict * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * Fix manual optimization in pl_example (#6373) * Fix automatic_optimization * Fix automatic_optimization * Uncomment fairscale * Update Sharded test with RunIf (#6384) * Remove optimizer_idx arg in manual optimization (#6093) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: chaton <[email protected]> * [doc] Improve Multiple Val/Test Dataloaders with simultaneous batches option (#6320) * improve doc to describe how to combine batches of multiple test and val dataloaders simultaneously * fix typo Co-authored-by: Adrian Wälchli <[email protected]> * use paramref Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * [doc] Fix closure in manual optimization (#6374) * Fix manual optimization docs * Fix typo. Thanks @import-antigravity * Fix ModelCheckpoint(monitor=None, save_last=True) not saving checkpoints (#6136) Co-authored-by: ananthsub <[email protected]> * Update TBLogger docs (#6315) * Update tensorboard.py * Update logging.rst * pep8 * Update logging.rst * Update logging.rst * Apply suggestions from code review * add code sample * Update logging.rst Co-authored-by: Jirka Borovec <[email protected]> * Fix trainer not resetting lightning_optimizers (#6372) Co-authored-by: Carlos Mocholí <[email protected]> * update python version (#6399) * Fix AttributeError: 'NoneType' object has no attribute 'finalize' on TPU (#6221) * Fix bug Fix AttributeError: 'NoneType' object has no attribute 'finalize' * Update CHANGELOG.md * deleted a period * Update CHANGELOG.md Co-authored-by: Akihiro Nitta <[email protected]> * Update CHANGELOG.md * Update pytorch_lightning/plugins/training_type/tpu_spawn.py Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * Run CI (#6402) * Pass {fit,validate,test,predict} to setup() and teardown() (#6386) * fix dp reduction test (#6404) * fix * update * fix * move the class outside * Add check for verbose attribute of ModelCheckpoint (#6419) Co-authored-by: Adrian Wälchli <[email protected]> * fixed bug where tuner would not tune lr if also tuning batch_size (#4688) * fixed bug where tuner would not tune lr if also tuning batch_size * added a '+1' to computing the smoothed loss. This maintains the behavior for the smoothed loss as before the bug fix * pep8 fix * add changelog Co-authored-by: chaton <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * update (#6403) * fix logger creating directory structure too early in DDP (#6380) * fix * add simple test * fix imports * add changelog * tighter test with on_fit_start hook closer to the dispatch call * move class inside test f unction * add a comment * Typing for tests 1/n (#6313) * typing * yapf * typing * [changelog] Update Changelog on release v1.2.3 (#6444) * update changelog * legacy 1.2.3 Co-authored-by: Jirka Borovec <[email protected]> * Improve DummyLogger (#6398) * fix dummy logger * docs * update docs * add changelog * add none return annotation * return empty string for name, version * Raise an exception if check_val_every_n_epoch is not an integer (#6411) * raise an exception if check_val_every_n_epoch is not an integer * remove unused object * add type hints * add return type * update exception message * update exception message * Set find unused parameters to True by default to fix breaking compatibility (#6438) * Set find unused parameters to True by default to fix breaking models, add suggestion to re-enable * Add changelog * [bug] All_gather support tensor on cpu (#6416) * add test * update changelog * update * rename function * [Fix] Ensure we set the default device before initializing deepspeed (#6460) * Ensure we set the default device before initializing deepspeed * Add CHANGELOG.md * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Kaushik B <[email protected]> Co-authored-by: Kaushik B <[email protected]> * Remove redundant test (#6466) * Add Trainer.validate(…) method to run one validation epoch (#4948) Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: chaton <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * Allow user to disable the automatic formatting of checkpoint file names. (#6277) * cleaning SWA (#6259) * rename * if * test * chlog * Remove opt from manual_backward in docs (#6267) * switch agents pool (#6270) * Allow user to disable the automatic formatting of checkpoint file names. * Added changelog entry. * Made flake8 happy. * Applied review suggestion: quotes for special characters in docstring Co-authored-by: Carlos Mocholí <[email protected]> * Fixed example in docstring. * Fixed syntax error in docstring. Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: thomas chaton <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> * Hotfix for torchvision (#6476) * cover subproc coverage (#6477) * argparse: Add use_argument_group=True (#6088) * argparse: Add inplace option Replicate in GAN model * datamodule: Deduplicate logic w/ argparser utilities * Update pl_examples/domain_templates/generative_adversarial_net.py Co-authored-by: Jirka Borovec <[email protected]> * Apply suggestions from code review Co-authored-by: Akihiro Nitta <[email protected]> * Keep docstrings * Correct name * Whitespace * Consistency * fix weird type stuff * try alt - use_argument_group * fix syntax + lint * fix ci errs * fix ci * change examples... still failing w/ "unrecognized arguments: --batch_size" * address review * mnist_datamodule: add some docstrings * argparse: check cls or cls.__init__ for param didn't capture issue, but meh * fix lint * fix no-doc edge case * address review Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> * Disable batch transfer in DP mode (#6098) * add exceptions and test * hook * fix * clean up * clean up * regex * regex * docs * rev * comment and docs * chlog * Apply suggestions from code review Co-authored-by: Carlos Mocholí <[email protected]> * Apply suggestions from code review Co-authored-by: chaton <[email protected]> * Monkey-patch device count * docs * pep * api_change Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: chaton <[email protected]> * remove obsolete todo in pl_examples (#6475) * [feat] Support iteration-based checkpointing in model checkpoint callback (#6146) * Update model_checkpoint.py * add tests * Update model_checkpoint.py * Update test_model_checkpoint.py * fix tests * every_n_batches * Update test_model_checkpoint.py * defaults * rm tests * Update model_checkpoint.py * Update test_model_checkpoint.py * Prune deprecated metrics for 1.3 (#6161) * prune deprecated metrics for 1.3 * isort / yapf * Update model_checkpoint.py * add tests * defaults * Update CHANGELOG.md * pre-commit * Update model_checkpoint.py * update defaults * Update test_remove_1-5.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * fix tests * Update test_model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update test_model_checkpoint.py * ckpt-callback * Update test_model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * validation-end * Update model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * clarify-names - Make names explicit as to which hooks they apply to - Use step instead of batch for consistency with global step * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * mutual-exclusive Make every_n_train_steps and every_n_val_epochs mutually exclusive * fix-default-0 * Update CHANGELOG.md * formatting * make-private make attributes private to the class * rebase Co-authored-by: Jirka Borovec <[email protected]> * update xla version (#6464) * Remove unused mixin attributes (#6487) * Remove unused mixing attributes * Missing import * [doc] Update the order of zero_grad and backward (#6478) * Fix zero_grad in docs * Fix zero_grad in docs * Fix tuner.scale_batch_size not finding batch size attribute when using datamodule (#5968) * Update docs for limit_predict_batches (#6507) * add docs and minor updates * docs * fraction * [bug] Update broadcast + reduce decision ModelCheckpoint] (#6410) * resolve bug * update * update changelog * update PR * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Carlos Mocholí <[email protected]> * add todo * resolve issues * resolve flake8 * update * add coverage for reduce * wip * restore back to brodbact * remove test.py * resolve flake8 * update * check world size * resolve test * update * use pytorch version when defined * update on comments * update on comments * flake8 * resolve bugs * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <[email protected]> * update * update * update * update * remove test * update * resolve flake8 * update * update * update * proxy * update * update * resolve typo * prune * update parallel * update Co-authored-by: Carlos Mocholí <[email protected]> * Handle torch.jit scripted modules in layer summary (#6511) * CI: resume testing with py3.8 (#6516) * testing on python 3.8 * req * document exceptions for metrics/functional (#6273) * document exceptions for metrics/functional * Apply suggestions from code review Co-authored-by: Rohit Gupta <[email protected]> * Apply suggestions from code review Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> * Mean Average Precision metric for Information Retrieval (1/5) (#5032) * init information retrieval metrics * changed retrieval metrics names, expanded arguments and fixed typo * added 'Retrieval' prefix to metrics and fixed conflict with already-present 'average_precision' file * improved code formatting * pep8 code compatibility * features/implemented new Mean Average Precision metrics for Information Retrieval + doc * fixed pep8 compatibility * removed threshold parameter and fixed typo on types in RetrievalMAP and improved doc * improved doc, put first class-specific args in RetrievalMetric and transformed RetrievalMetric in abstract class * implemented tests for functional and class metric. fixed typo when input tensors are empty or when all targets are False * fixed typos in doc and changed torch.true_divide to torch.div * fixed typos pep8 compatibility * fixed types in long division in ir_average_precision and example in mean_average_precision * RetrievalMetric states are not lists and _metric method accepts predictions and targets for easier extension * updated CHANGELOG file * added '# noqa: F401' flag to not used imports * added double space before '# noqa: F401' flag * Update CHANGELOG.md Co-authored-by: Jirka Borovec <[email protected]> * change get_mini_groups in get_group_indexes * added checks on target inputs * minor refactoring for code cleanness * split tests over exception raising in separate function && refactored test code into multiple functions * fixed pep8 compatibility * implemented suggestions of @SkafteNicki * fixed imports for isort and added types annontations to functions in test_map.py * isort on test_map and fixed typing * isort on retrieval and on __init__.py and utils.py in metrics package * fixed typo in pytorch_lightning/metrics/__init__.py regarding code style * fixed yapf compatibility * fixed yapf compatibility * fixed typo in doc Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Nicki Skafte <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * CI: Azure publish results (#6514) * deprecate metrics pkg (#6505) * deprecate metrics * examples * req * docs * Apply suggestions from code review Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Nicki Skafte <[email protected]> * pep8 Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Nicki Skafte <[email protected]> * [test] lr_find with bs_scale (#6422) * init test: test_lr_find_with_bs_scale * Update test_lr_finder.py * remove gpu req * try boring model * custom boring model * pep8 * fix typo * Update test_lr_finder.py * typo * typo * Update DeepSpeed docs (#6528) * Clean up docs and add some explicitness around stages * Apply suggestions from code review Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * fix attribute access in LightningModule.toggle_optimizer (#6513) * Update hook lifecycle (#6538) * Update hook lifecycle * Update docs/source/common/lightning_module.rst * Prune metrics base classes 2/n (#6530) * base class * extensions * chlog * _stable_1d_sort * _check_same_shape * _input_format_classification_one_hot * utils * to_onehot * select_topk * to_categorical * get_num_classes * reduce * class_reduce * tests * Custom Plugin is_distributed (#6537) * return from plugin * dont return for tpu * refactor reading env defaults (#6510) * change tests * fix * test * _defaults_from_env_vars Co-authored-by: Carlos Mocholí <[email protected]> * Prune metric: helpers and inputs 3/n (#6547) * _basic_input_validation * _check_shape_and_type_consistency * _check_num_classes_binary * _check_num_classes_mc * _check_num_classes_ml * _check_top_k * _check_classification_inputs * _input_format_classification * _reduce_stat_scores * DataType * rest * flake8 * chlog * prune warning & deprecation wrapper (#6540) * docs * wrapper * test * count * flake8 * Add outputs param for `on_val/test_epoch_end` hooks (#6120) * add outputs param for on_val/test_epoch_end hooks * update changelog * fix warning message * add custom call hook * cache logged metrics * add args to docstrings * use warning cache * add utility method for param in sig check * Update CHANGELOG.md Co-authored-by: Jirka Borovec <[email protected]> * update docstring * add test for eval epoch end hook * add types and replace model ref * add deprecation test * fix test fx name * add model hooks warning * add old signature model to tests * add clear warning cache * sopport args param * update tests * add tests for model hooks * code suggestions * add signature utils * fix pep8 issues * fix pep8 issues * fix outputs issue * fix tests * code fixes * fix validate test * test Co-authored-by: Jirka Borovec <[email protected]> * [doc] Add Zero Grad `set_to_none=True` trick (#6548) * add trick to doc * update * update path * Update docs/source/benchmarking/performance.rst Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * fix deprecation wrapper & tests (#6553) * fix deprecation wrapper & tests * flake8 * prune metric: accuracy 4/n (#6515) * prune accuracy * chlog * flake8 * Apply suggestions from code review Co-authored-by: Nicki Skafte <[email protected]> * wrap * test * test * fix Co-authored-by: Nicki Skafte <[email protected]> * Prune metrics: AUC & AUROC (#6572) * class: AUC AUROC * func: auc auroc * format * tests * [doc] Update Dict Train Loader doc. (#6579) * update doc * update example * Prune metrics: precision & recall 6/n (#6573) * avg precision * precision * recall * curve * tests * chlog * isort * fix * Update Changelog for v1.2.4 (#6581) * Update changelog for v1.2.4 * lagacy v1.2.4 * prune duplicates from changelog Co-authored-by: Jirka Borovec <[email protected]> * [Fix] Move init dist connection into the setup function (#6506) * Move connection setup into the setup function. Call setup hook after we set up the accelerator * Added CHANGELOG.md * fix setup order in callback test * fix input arguments in test * Mock distributed function, remove protection to turn into training type hook * Remove import * Add missing mock, ensure custom plugin does not create children process * Skip test on windows * Update deepspeed to init connection in setup * Do not initialize distributed module * Move DeepSpeed tests to special tests since dist communication is being set up * Special the test to see if this fixes CI * Delete accelerator connector test to see if its causing build to fail * Delete deepspeed test * Revert "Delete accelerator connector test to see if its causing build to fail" This reverts commit edde60b8 * Revert "Delete deepspeed test" This reverts commit 9d317429 * Reverse hook * Reverse setup hooks to debug again * Add todo so i know where i left off * For single device move in pre_dispatch after setup function * Add additional model to device hook if any additional parameters have been set * See if we can enable deepspeed tests * Revert "See if we can enable deepspeed tests" This reverts commit b5450def * See if this hook approach works * Introduce new granular hooks * Remove import, fix tpu spawn by moving the function to setup * Added missing special test Co-authored-by: Adrian Wälchli <[email protected]> * Fix all_gather for tpu_cores=8 (#6587) * Update Gradient Clipping for TPU Accelerator (#6576) * NGC container PoC (#6187) * add NVIDIA flows * push * pull * ... * extras * ci prune * fix * tag * . * list * Automatically set sync_batchnorm for training_type_plugin (#6536) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Roger Shieh <[email protected]> Co-authored-by: Kaushik Bokka <[email protected]> * Prune metrics: other classification 7/n (#6584) * confusion_matrix * iou * f_beta * hamming_distance * stat_scores * tests * flake8 * chlog * fixing examples (#6600) * try Azure * -e * path * Add AMP for validation, prediction and testing (#6565) * Add Tests for val and test-steps * Add native AMP * pep8 tests * pep8 plugin * changelog * Add trainer.predict config validation (#6543) Co-authored-by: Carlos Mocholí <[email protected]> * Add DDP Spawn being default for Multi GPUs (#6292) * Move profiler tests (#6619) * drop mypy from .pre-commit-config.yaml (#6542) * Clean utilities/argparse and add missing tests (#6607) * Allow training type plugin to delay optimizer creation (FSDP 2/n) (#6331) * Allow training_type_plugin to delay optimizer configure * Add missing references to trainer, add a CPU accelerator based test * Add teardown method to BaseProfiler. (#6370) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: ananthsub <[email protected]> * refactoring setup (#6590) * refactoring setup * . * docs * flake8 * hotfix: mock examples (#6632) * mock examples * drop from GA * [refactor] Add setup to profilers + _run_stage_setup to trainer 2/5 (#6633) * add setup * update * updates on comment * Minor changes * Extra import * Docs Co-authored-by: Carlos Mocholi <[email protected]> * fix comparing versions (#6434) * fix comparing versions * chlog * . * ... * datasets * Prune metrics: regression 8/n (#6636) * explained_variance * tests * mean_absolute_error * mean_squared_error * mean_relative_error * mean_squared_log_error * chlog * Prune metyrics: regression 9/n (#6637) * psnr * r2score * ssim * chlog * Refactor base profilers 3/5 (#6621) Co-authored-by: tchaton <[email protected]> * prune metrics: info retrieval (#6649) * Flash predict step (#6577) * add predict_step * Update predict_loop.py * Update trainer.py * Update trainer.py * resolve bugs * update * update * update * resolve bug * resolve some failing tests * udpate tests * update * resolve tests * add a test * remove typo * add a test for attachement * update * changed to on_train_dataloader * remove __flash_special_attr__ * resolve tests * update * update * update * update on comments * Update pytorch_lightning/trainer/data_loading.py Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * fix back-compatibility for Accel (#6655) * Refactor PyTorch profiler 4/5 (#6349) Co-authored-by: thomas chaton <[email protected]> * Add PyTorch 1.8 Profiler 5/5 (#6618) * Refactor profilers * Update PassThrough * WIP - This is broken and will change * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: thomas chaton <[email protected]> * resolve tests * resolve tests * find output * try something * update * add support for test and predict * update * update * use getattr * test * test * update * tests * update * update * update * update * update * remove file * update * update * update * update * update * test * update# * update * update tests * update * add suport for 1.8 * rename records * add support for 1.8 * update * resolve flake8 * resolve test * Refactor basic profilers * Fixes * Unused import * Introduce setup * Profile on all ranks. Print to stdout on 0 * Introduce dirpath + filename * CHANGELOG * Add tests. Address comments * add `on_run_stage_setup` * add on_run_stage_setup function * update * add test for RegisterRecordFunction * update lightnng flow direction * move variable to private * remove trace * Undo code that should be in 3/4 * Multi-stage multi-rank * 2/5 changes * Pass stage in __del__ * Remove TODOs * Describe on_evaluation_end. Add tests * Typo * Address comments * deepcopy tests * Advanced teardown * Fix teardown test * Fix tests * Minor change * Update CHANGELOG.md * Fix test * Quick fixes * Fix 6522 * resolve ddp tests * resolve tests * resolve some tests …

SeanNaren added 2 commits February 21, 2021 13:26

Expose deepspeed config parameters to init function due to instabilit…

bf9cd51

…y in parameters

See if tests can run on normal CI, without special tests

bffb11c

SeanNaren added the bug Something isn't working label Feb 21, 2021

SeanNaren added this to the 1.2.x milestone Feb 21, 2021

SeanNaren self-assigned this Feb 21, 2021

SeanNaren requested review from awaelchli, Borda, carmocca, justusschock, tchaton and williamFalcon as code owners February 21, 2021 13:32

SeanNaren added 3rd party Related to a 3rd-party distributed Generic distributed-related topic labels Feb 21, 2021

Add changelog

973b43c

SeanNaren commented Feb 21, 2021

View reviewed changes

tests/plugins/test_deepspeed_plugin.py Show resolved Hide resolved

carmocca approved these changes Feb 21, 2021

View reviewed changes

tests/plugins/test_deepspeed_plugin.py Show resolved Hide resolved

pytorch_lightning/plugins/training_type/deepspeed.py Outdated Show resolved Hide resolved

Update pytorch_lightning/plugins/training_type/deepspeed.py

5a536cf

Co-authored-by: Carlos Mocholí <[email protected]>

kaushikb11 approved these changes Feb 21, 2021

View reviewed changes

awaelchli approved these changes Feb 21, 2021

View reviewed changes

Borda approved these changes Feb 21, 2021

View reviewed changes

Borda merged commit 432e563 into master Feb 21, 2021

Borda deleted the fix/fp16_enable branch February 21, 2021 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose DeepSpeed FP16 parameters due to loss instability #6115

Expose DeepSpeed FP16 parameters due to loss instability #6115

SeanNaren commented Feb 21, 2021 •

edited

Loading

codecov bot commented Feb 21, 2021 •

edited

Loading

Expose DeepSpeed FP16 parameters due to loss instability #6115

Expose DeepSpeed FP16 parameters due to loss instability #6115

Conversation

SeanNaren commented Feb 21, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Feb 21, 2021 • edited Loading

Codecov Report

SeanNaren commented Feb 21, 2021 •

edited

Loading

codecov bot commented Feb 21, 2021 •

edited

Loading