Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gradient accumulation for ShardedDataParallel #9122

Merged

Conversation

ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Aug 26, 2021

What does this PR do?

Followup to this comment: #9101 (comment)

ShardedDataParallel supports the no_sync context manager too: https://fairscale.readthedocs.io/en/latest/_modules/fairscale/nn/data_parallel/sharded_ddp.html#ShardedDataParallel.no_sync

But we're not taking advantage of it now due to this check:https://github.com/PyTorchLightning/pytorch-lightning/blob/9d62f248476c6358d8707188f7b20fafa79f8a4f/pytorch_lightning/plugins/training_type/parallel.py#L131-L133

as the model here is wrapped with ShardedDataParallel not DistributedDataParallel

Breaking the inheritance chain in these plugins will make these opportunities clearer.

@SeanNaren n00b question: do you have suggestions for how to verify this a unit/integration test? especially to prevent future regressions

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@ananthsub ananthsub added distributed Generic distributed-related topic feature Is an improvement or enhancement labels Aug 26, 2021
@ananthsub ananthsub added this to the v1.4.x milestone Aug 26, 2021
@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

Merging #9122 (64e3e7c) into master (d022f6f) will decrease coverage by 4%.
The diff coverage is 50%.

❗ Current head 64e3e7c differs from pull request most recent head bc9b7ee. Consider uploading reports for the commit bc9b7ee to get more accurate results

@@           Coverage Diff           @@
##           master   #9122    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         179     179            
  Lines       15303   15317    +14     
=======================================
- Hits        14200   13590   -610     
- Misses       1103    1727   +624     

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@SeanNaren
Copy link
Contributor

Thanks for the PR @ananthsub!

What we can ensure in a test is that the context manager is called correctly during grad accumulation, maybe by mocking/patching the function?

@mergify mergify bot added the ready PRs ready to be merged label Aug 27, 2021
@tchaton tchaton enabled auto-merge (squash) August 27, 2021 18:14
@mergify mergify bot removed the has conflicts label Aug 27, 2021
@ananthsub ananthsub disabled auto-merge August 27, 2021 18:24
@ananthsub
Copy link
Contributor Author

@tchaton I'll add a test first before merging. still on my todo

@tchaton tchaton added the bug Something isn't working label Aug 30, 2021
@ananthsub ananthsub force-pushed the fix/sharded-block-backward-sync branch from eda6d7c to 024cc87 Compare September 22, 2021 04:54
@ananthsub ananthsub enabled auto-merge (squash) September 22, 2021 05:43
@ananthsub ananthsub disabled auto-merge September 22, 2021 06:05
@ananthsub ananthsub force-pushed the fix/sharded-block-backward-sync branch from 64e3e7c to bc9b7ee Compare September 22, 2021 06:40
@ananthsub ananthsub enabled auto-merge (squash) September 22, 2021 06:41
@ananthsub ananthsub disabled auto-merge September 22, 2021 06:42
@ananthsub ananthsub force-pushed the fix/sharded-block-backward-sync branch from e163f72 to a727b24 Compare September 22, 2021 08:18
@ananthsub ananthsub enabled auto-merge (squash) September 22, 2021 08:19
@ananthsub ananthsub merged commit a71be50 into Lightning-AI:master Sep 22, 2021
SeanNaren pushed a commit that referenced this pull request Sep 22, 2021
* Fix gradient accumulation for `ShardedDataParallel`

* Update changelog

* Update pytorch_lightning/plugins/training_type/sharded.py

* add test

* Update test_sharded_plugin.py

* Update test_sharded_plugin.py

* Update test_sharded_plugin.py
@ananthsub ananthsub deleted the fix/sharded-block-backward-sync branch September 22, 2021 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants