Remove `call_configure_sharded_model` lifecycle property #9612

ananthsub · 2021-09-20T16:28:40Z

What does this PR do?

Part of #8722

Changes:

Remove the property guarding the hook: The Trainer will always call configure_sharded_model in each of fit validate test etc. This avoids subtle side-effects from prior runs and makes the call order consistent
In turn, we strongly recommend users to implement configure_sharded_model idempotently. The update to TestFSDPModel demonstrates how one can check if the layers are already wrapped with FSDP and return early if so. This is the same spirit of Avoid rewrapping LightningModules in plugins #8593 but in user-land

@SeanNaren

Does your PR introduce any breaking changes? If yes, please list them.

Calls configure_sharded_model unconditionally
This removes the property on the accelerator/training type plugin
Removes support for call_configure_sharded_model_hook on the LightningModule (which is not officially part of the LightningModule API)

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-09-20T16:52:07Z

Codecov Report

Merging #9612 (2562888) into master (eb6aa7a) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #9612    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         179     179            
  Lines       15307   15306     -1     
=======================================
- Hits        14199   13583   -616     
- Misses       1108    1723   +615

tests/plugins/test_ddp_fully_sharded_with_full_state_dict.py

pytorch_lightning/core/hooks.py

SeanNaren

Multiple calls to configure_sharded_model should be fine for DeepSpeed as well as there is a guard in place to not re-partition parameters that have already been partitioned: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L499

@ananthsub is it important to update the docs now? or should we wait till further changes are made to support a persistent model state across Trainer stages?

tchaton

LGTM !

tchaton · 2021-09-22T12:19:18Z

Hey @ananthsub,

Any progress on the failing Azure tests ?

Best,
T.C

pytorch_lightning/accelerators/accelerator.py

ananthsub · 2021-09-23T17:35:11Z

Hey @ananthsub,

Any progress on the failing Azure tests ?

Best,
T.C

@tchaton I'm unable to reproduce the failures locally. is there any interleaving of tests that could cause this to fail in CI but not locally?

awaelchli · 2021-09-23T18:06:51Z

@ananthsub I can reproduce locally by running py.test -v tests/plugins.

the first test that fails seems to be test_deepspeed_skip_backward_raises.

When calling the tests individually, they pass. This is not unfamiliar as we had this a few times before. Since distributed logic is all globally shared in the e.g. in the torch package, things can leak from one test to the other (is my interpretation).

tests/plugins/test_deepspeed_plugin.py::test_deepspeed_skip_backward_raises FAILED                                                                                                                                                                                                                                                    [ 54%]
tests/plugins/test_deepspeed_plugin.py::test_deepspeed_warn_train_dataloader_called SKIPPED (Requires: [Special execution])                                                                                                                                                                                                           [ 54%]
tests/plugins/test_deepspeed_plugin.py::test_deepspeed_setup_train_dataloader SKIPPED (Requires: [Special execution])                                                                                                                                                                                                                 [ 55%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModel] FAILED                                                                                                                                                                                                                                         [ 56%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModelNoForward] FAILED                                                                                                                                                                                                                                [ 56%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModelComplexBuffer] PASSED                                                                                                                                                                                                                            [ 57%]
tests/plugins/test_double_plugin.py::test_double_precision_ddp FAILED
tests/plugins/test_sharded_plugin.py::test_configure_ddp FAILED                                                                                                                                                                                                                                                                       [ 77%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded[DDPShardedPlugin] FAILED                                                                                                                                                                                                                                             [ 77%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded[DDPSpawnShardedPlugin] FAILED                                                                                                                                                                                                                                        [ 78%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[1-params0-0] FAILED                                                                                                                                                                                                                               [ 78%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[1-params1-128] FAILED                                                                                                                                                                                                                             [ 79%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[2-params0-0] FAILED                                                                                                                                                                                                                               [ 80%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[2-params1-128] FAILED                                                                                                                                                                                                                             [ 80%]
tests/plugins/test_sharded_plugin.py::test_block_backward_sync PASSED                                                                                                                                                                                                                                                                 [ 81%]
tests/plugins/test_single_device_plugin.py::test_single_cpu PASSED                                                                                                                                                                                                                                                                    [ 81%]
tests/plugins/test_single_device_plugin.py::test_single_gpu FAILED

Looking at the error messages, I see something disturbing:

tests/plugins/test_sharded_plugin.py:303:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pytorch_lightning/plugins/training_type/sharded.py:46: in configure_ddp
    LightningShardedDataParallel(self.model),
../../anaconda3/envs/pl/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py:240: in wrapper
    print_rank_0(f'Before initializing {module.__class__.__name__}',

The error message is originating from deepspeed 🤣 No clue how that can be. Need to investigate the code path.

ananthsub · 2021-09-23T19:07:03Z

This is not unfamiliar as we had this a few times before. Since distributed logic is all globally shared in the e.g. in the torch package, things can leak from one test to the other (is my interpretation).

Do you remember how this was solved before?

pytorch_lightning/trainer/trainer.py

carmocca · 2021-09-24T00:03:46Z

Do you remember how this was solved before?

There is no solution without something like #8080. The workaround is to run the tests per process (what special_tests.sh does) and/or unrolling parametrizations

ananthsub added distributed Generic distributed-related topic refactor breaking change Includes a breaking change labels Sep 20, 2021

ananthsub added this to the v1.5 milestone Sep 20, 2021

ananthsub requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners September 20, 2021 16:28

ananthsub changed the title ~~Removec call_configure_sharded_model lifecycle property~~ Remove call_configure_sharded_model lifecycle property Sep 20, 2021

four4fish reviewed Sep 20, 2021

View reviewed changes

tests/plugins/test_ddp_fully_sharded_with_full_state_dict.py Show resolved Hide resolved

carmocca approved these changes Sep 21, 2021

View reviewed changes

tests/plugins/test_ddp_fully_sharded_with_full_state_dict.py Show resolved Hide resolved

ananthsub commented Sep 21, 2021

View reviewed changes

pytorch_lightning/core/hooks.py Show resolved Hide resolved

ananthsub force-pushed the feat/cleanup-fsdp-1 branch from e1071bb to 22fd170 Compare September 21, 2021 17:58

SeanNaren approved these changes Sep 22, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Sep 22, 2021

awaelchli approved these changes Sep 22, 2021

View reviewed changes

tchaton approved these changes Sep 22, 2021

View reviewed changes

tchaton enabled auto-merge (squash) September 22, 2021 12:18

ananthsub added 2 commits September 23, 2021 00:12

Removec call_configure_sharded_model lifecycle property

bb67776

changelog + update docstring

47def52

ananthsub force-pushed the feat/cleanup-fsdp-1 branch from 22fd170 to 47def52 Compare September 23, 2021 07:13

Borda reviewed Sep 23, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved

awaelchli reviewed Sep 23, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

ananthsub added 5 commits September 23, 2021 13:54

Update test_deepspeed_plugin.py

ea71156

Update test_ddp_fully_sharded_with_full_state_dict.py

e5b7684

Update test_ddp_fully_sharded_with_full_state_dict.py

17a6f71

Update test_ddp_fully_sharded_with_full_state_dict.py

732673e

Update test_ddp_fully_sharded_with_full_state_dict.py

2562888

tchaton merged commit 41e3be1 into Lightning-AI:master Sep 24, 2021

daniellepintz mentioned this pull request Sep 24, 2021

Add torch v1.11.0 to the list of versions in adjust_versions.py #9679

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `call_configure_sharded_model` lifecycle property #9612

Remove `call_configure_sharded_model` lifecycle property #9612

ananthsub commented Sep 20, 2021 •

edited

Loading

codecov bot commented Sep 20, 2021 •

edited

Loading

SeanNaren left a comment

tchaton left a comment

tchaton commented Sep 22, 2021

ananthsub commented Sep 23, 2021

awaelchli commented Sep 23, 2021 •

edited

Loading

ananthsub commented Sep 23, 2021

carmocca commented Sep 24, 2021 •

edited

Loading

Remove call_configure_sharded_model lifecycle property #9612

Remove call_configure_sharded_model lifecycle property #9612

Conversation

ananthsub commented Sep 20, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

codecov bot commented Sep 20, 2021 • edited Loading

Codecov Report

SeanNaren left a comment

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

tchaton commented Sep 22, 2021

ananthsub commented Sep 23, 2021

awaelchli commented Sep 23, 2021 • edited Loading

ananthsub commented Sep 23, 2021

carmocca commented Sep 24, 2021 • edited Loading

Remove `call_configure_sharded_model` lifecycle property #9612

Remove `call_configure_sharded_model` lifecycle property #9612

ananthsub commented Sep 20, 2021 •

edited

Loading

codecov bot commented Sep 20, 2021 •

edited

Loading

awaelchli commented Sep 23, 2021 •

edited

Loading

carmocca commented Sep 24, 2021 •

edited

Loading