Fix gradient accumulation for `ShardedDataParallel` #9122

ananthsub · 2021-08-26T01:38:19Z

What does this PR do?

Followup to this comment: #9101 (comment)

ShardedDataParallel supports the no_sync context manager too: https://fairscale.readthedocs.io/en/latest/_modules/fairscale/nn/data_parallel/sharded_ddp.html#ShardedDataParallel.no_sync

But we're not taking advantage of it now due to this check:https://github.com/PyTorchLightning/pytorch-lightning/blob/9d62f248476c6358d8707188f7b20fafa79f8a4f/pytorch_lightning/plugins/training_type/parallel.py#L131-L133

as the model here is wrapped with ShardedDataParallel not DistributedDataParallel

Breaking the inheritance chain in these plugins will make these opportunities clearer.

@SeanNaren n00b question: do you have suggestions for how to verify this a unit/integration test? especially to prevent future regressions

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-08-26T02:16:50Z

Codecov Report

Merging #9122 (64e3e7c) into master (d022f6f) will decrease coverage by 4%.
The diff coverage is 50%.

❗ Current head 64e3e7c differs from pull request most recent head bc9b7ee. Consider uploading reports for the commit bc9b7ee to get more accurate results

@@           Coverage Diff           @@
##           master   #9122    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         179     179            
  Lines       15303   15317    +14     
=======================================
- Hits        14200   13590   -610     
- Misses       1103    1727   +624

pytorch_lightning/plugins/training_type/sharded.py

tchaton

LGTM !

SeanNaren · 2021-08-26T09:05:38Z

Thanks for the PR @ananthsub!

What we can ensure in a test is that the context manager is called correctly during grad accumulation, maybe by mocking/patching the function?

ananthsub · 2021-08-27T18:24:36Z

@tchaton I'll add a test first before merging. still on my todo

for more information, see https://pre-commit.ci

* Fix gradient accumulation for `ShardedDataParallel` * Update changelog * Update pytorch_lightning/plugins/training_type/sharded.py * add test * Update test_sharded_plugin.py * Update test_sharded_plugin.py * Update test_sharded_plugin.py

ananthsub requested review from awaelchli, justusschock, SeanNaren and tchaton as code owners August 26, 2021 01:38

ananthsub added distributed Generic distributed-related topic feature Is an improvement or enhancement labels Aug 26, 2021

ananthsub added this to the v1.4.x milestone Aug 26, 2021

ananthsub requested review from Borda, carmocca, kaushikb11 and williamFalcon as code owners August 26, 2021 01:48

ananthsub commented Aug 26, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/sharded.py Outdated Show resolved Hide resolved

four4fish reviewed Aug 26, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/sharded.py Show resolved Hide resolved

tchaton approved these changes Aug 26, 2021

View reviewed changes

awaelchli approved these changes Aug 27, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Aug 27, 2021

Borda approved these changes Aug 27, 2021

View reviewed changes

mergify bot added the has conflicts label Aug 27, 2021

tchaton enabled auto-merge (squash) August 27, 2021 18:14

mergify bot removed the has conflicts label Aug 27, 2021

ananthsub disabled auto-merge August 27, 2021 18:24

tchaton added the bug Something isn't working label Aug 30, 2021

mergify bot added has conflicts and removed has conflicts labels Sep 2, 2021

mergify bot removed the has conflicts label Sep 17, 2021

tchaton assigned ananthsub Sep 21, 2021

ananthsub added 4 commits September 21, 2021 15:09

Fix gradient accumulation for ShardedDataParallel

b827a48

Update changelog

2a763a5

Update pytorch_lightning/plugins/training_type/sharded.py

8687ab8

add test

024cc87

ananthsub force-pushed the fix/sharded-block-backward-sync branch from eda6d7c to 024cc87 Compare September 22, 2021 04:54

Update test_sharded_plugin.py

12ebe2c

ananthsub enabled auto-merge (squash) September 22, 2021 05:43

ananthsub disabled auto-merge September 22, 2021 06:05

Update test_sharded_plugin.py

bc9b7ee

ananthsub force-pushed the fix/sharded-block-backward-sync branch from 64e3e7c to bc9b7ee Compare September 22, 2021 06:40

ananthsub enabled auto-merge (squash) September 22, 2021 06:41

ananthsub disabled auto-merge September 22, 2021 06:42

justusschock approved these changes Sep 22, 2021

View reviewed changes

Update test_sharded_plugin.py

a727b24

ananthsub force-pushed the fix/sharded-block-backward-sync branch from e163f72 to a727b24 Compare September 22, 2021 08:18

ananthsub enabled auto-merge (squash) September 22, 2021 08:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

6cf3335

for more information, see https://pre-commit.ci

ananthsub merged commit a71be50 into Lightning-AI:master Sep 22, 2021

ananthsub deleted the fix/sharded-block-backward-sync branch September 22, 2021 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gradient accumulation for `ShardedDataParallel` #9122

Fix gradient accumulation for `ShardedDataParallel` #9122

ananthsub commented Aug 26, 2021 •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading

tchaton left a comment

SeanNaren commented Aug 26, 2021

ananthsub commented Aug 27, 2021

Fix gradient accumulation for ShardedDataParallel #9122

Fix gradient accumulation for ShardedDataParallel #9122

Conversation

ananthsub commented Aug 26, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

codecov bot commented Aug 26, 2021 • edited Loading

Codecov Report

tchaton left a comment

Choose a reason for hiding this comment

SeanNaren commented Aug 26, 2021

ananthsub commented Aug 27, 2021

Fix gradient accumulation for `ShardedDataParallel` #9122

Fix gradient accumulation for `ShardedDataParallel` #9122

ananthsub commented Aug 26, 2021 •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading