-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove call_configure_sharded_model
lifecycle property
#9612
Conversation
call_configure_sharded_model
lifecycle propertycall_configure_sharded_model
lifecycle property
Codecov Report
@@ Coverage Diff @@
## master #9612 +/- ##
=======================================
- Coverage 93% 89% -4%
=======================================
Files 179 179
Lines 15307 15306 -1
=======================================
- Hits 14199 13583 -616
- Misses 1108 1723 +615 |
e1071bb
to
22fd170
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple calls to configure_sharded_model
should be fine for DeepSpeed as well as there is a guard in place to not re-partition parameters that have already been partitioned: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L499
@ananthsub is it important to update the docs now? or should we wait till further changes are made to support a persistent model state across Trainer stages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Hey @ananthsub, Any progress on the failing Azure tests ? Best, |
22fd170
to
47def52
Compare
@tchaton I'm unable to reproduce the failures locally. is there any interleaving of tests that could cause this to fail in CI but not locally? |
@ananthsub I can reproduce locally by running the first test that fails seems to be When calling the tests individually, they pass. This is not unfamiliar as we had this a few times before. Since distributed logic is all globally shared in the e.g. in the torch package, things can leak from one test to the other (is my interpretation). tests/plugins/test_deepspeed_plugin.py::test_deepspeed_skip_backward_raises FAILED [ 54%]
tests/plugins/test_deepspeed_plugin.py::test_deepspeed_warn_train_dataloader_called SKIPPED (Requires: [Special execution]) [ 54%]
tests/plugins/test_deepspeed_plugin.py::test_deepspeed_setup_train_dataloader SKIPPED (Requires: [Special execution]) [ 55%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModel] FAILED [ 56%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModelNoForward] FAILED [ 56%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModelComplexBuffer] PASSED [ 57%]
tests/plugins/test_double_plugin.py::test_double_precision_ddp FAILED
tests/plugins/test_sharded_plugin.py::test_configure_ddp FAILED [ 77%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded[DDPShardedPlugin] FAILED [ 77%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded[DDPSpawnShardedPlugin] FAILED [ 78%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[1-params0-0] FAILED [ 78%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[1-params1-128] FAILED [ 79%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[2-params0-0] FAILED [ 80%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[2-params1-128] FAILED [ 80%]
tests/plugins/test_sharded_plugin.py::test_block_backward_sync PASSED [ 81%]
tests/plugins/test_single_device_plugin.py::test_single_cpu PASSED [ 81%]
tests/plugins/test_single_device_plugin.py::test_single_gpu FAILED Looking at the error messages, I see something disturbing:
The error message is originating from deepspeed 🤣 No clue how that can be. Need to investigate the code path. |
Do you remember how this was solved before? |
There is no solution without something like #8080. The workaround is to run the tests per process (what |
What does this PR do?
Part of #8722
Changes:
configure_sharded_model
in each offit
validate
test
etc. This avoids subtle side-effects from prior runs and makes the call order consistentconfigure_sharded_model
idempotently. The update toTestFSDPModel
demonstrates how one can check if the layers are already wrapped with FSDP and return early if so. This is the same spirit of Avoid rewrapping LightningModules in plugins #8593 but in user-land@SeanNaren
Does your PR introduce any breaking changes? If yes, please list them.
configure_sharded_model
unconditionallycall_configure_sharded_model_hook
on the LightningModule (which is not officially part of the LightningModule API)Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃