Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sharded optimizer state dumping outside of sharded strategies #14208

Merged
merged 17 commits into from
Aug 26, 2022

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Aug 15, 2022

What does this PR do?

Fixes #6387
Redo of #11867

Motivation: User wants to checkpoint the sharded optimizer outside of ddp/sharded strategy, e.g. not using it.

This PR implements it exactly as proposed in #6387 and #11867. However, this leaks fairscale specific logic into the base strategy. We have some options to mitigate the issue:

Option 0:
Do not care about this. Move forward with this PR as is.

Option A:
Only apply the current approach to the native sharded optimizer from torch. For Fairscale, the user is still forced to use the dedicated strategy and we will keep the logic for this one in the Fairscale sharded strategy.

Option B:
Instead of moving it to the top into the base strategy, move it to the parallel strategy. This means you will have to use at least a strategy where parallel execution is assumed. Otherwise, the logic stays the same.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

cc @Borda @justusschock @awaelchli @rohitgr7 @akihironitta

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 15, 2022
@awaelchli awaelchli changed the title implement Support for OSS optimizers when dumping checkpoints outside of sharded strategies Aug 15, 2022
@awaelchli awaelchli added strategy: fairscale sharded (removed) Sharded Data Parallel feature Is an improvement or enhancement refactor labels Aug 15, 2022
@awaelchli awaelchli changed the title Support for OSS optimizers when dumping checkpoints outside of sharded strategies Support for sharded optimizer state dumping outside of sharded strategies Aug 15, 2022
@awaelchli awaelchli changed the title Support for sharded optimizer state dumping outside of sharded strategies Support sharded optimizer state dumping outside of sharded strategies Aug 15, 2022
@awaelchli awaelchli force-pushed the feature/oss-state-outside-ddp branch from fffb49c to 3ab0c6a Compare August 15, 2022 12:48
@awaelchli awaelchli marked this pull request as ready for review August 16, 2022 15:42
@awaelchli awaelchli self-assigned this Aug 16, 2022
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation and test-wise this is fine, I just really don't like the solution we opted for (I know, I'm late to the party, but I missed the issue).

This feels like negating all the efforts we made during the refactor to separate concerns.

src/pytorch_lightning/strategies/strategy.py Outdated Show resolved Hide resolved
@mergify mergify bot removed the has conflicts label Aug 18, 2022
@awaelchli awaelchli added this to the pl:1.8 milestone Aug 18, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Aug 18, 2022
Copy link
Contributor

@rohitgr7 rohitgr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be breaking the strategy encapsulation, and now since PyTorch copied Zero from deepspeed, we need to provide special support for it.

Why not let users override the Strategy if they are mixing the configuration? or use deepspeed stage 1 if they want to use Zero or some special wrapper?

Also, I remember @otaj is working on checkpoint-related part which will let users choose how the checkpoint is created. So maybe that could be a better way to handle such configuration?

tests/tests_pytorch/strategies/test_sharded_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_ddp_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_ddp_strategy.py Outdated Show resolved Hide resolved
@awaelchli awaelchli requested a review from otaj as a code owner August 22, 2022 12:29
@awaelchli
Copy link
Contributor Author

@rohitgr7

I think this might be breaking the strategy encapsulation, and now since PyTorch copied Zero from deepspeed, we need to provide special support for it.

I agree, this does not fit well into our strategy design. It is also not my personal opinion that we should do it that way. But in sprint planning it was determined that it is important to work on this issue and I was assigned to it. So I am going to complete the task regardless.

Also, I remember @otaj is working on checkpoint-related part which will let users choose how the checkpoint is created. So maybe that could be a better way to handle such configuration?

It is not the responsibility of the checkpoint callback to know WHAT to save. Therefore it won't belong there.

@rohitgr7
Copy link
Contributor

But in sprint planning it was determined that it is important to work on this issue and I was assigned to it.

can we discuss this again today or in the retro?

It is not the responsibility of the checkpoint callback to know WHAT to save. Therefore it won't belong there.

I didn't say checkpoint callback.

@Borda Borda assigned awaelchli and unassigned awaelchli Aug 22, 2022
@awaelchli
Copy link
Contributor Author

Should this PR and the related issue be closed?

  • 👀 Yes, close it
  • 👍 No, move forward with this solution

@carmocca
Copy link
Contributor

This PR resolves an existing issue in our codebase. Whether the patch is perfect design-wise should be secondary to the improvement in stability. The latter can be improved with time or on a follow-up as designs mature, especially if there's no clear alternative at the moment. My vote is 👍

@awaelchli awaelchli enabled auto-merge (squash) August 23, 2022 21:01
@codecov
Copy link

codecov bot commented Aug 26, 2022

Codecov Report

Merging #14208 (fd4cab8) into master (a01e016) will increase coverage by 15%.
The diff coverage is 71%.

@@            Coverage Diff            @@
##           master   #14208     +/-   ##
=========================================
+ Coverage      61%      76%    +15%     
=========================================
  Files         332      332             
  Lines       26852    26883     +31     
=========================================
+ Hits        16421    20428   +4007     
+ Misses      10431     6455   -3976     

@awaelchli awaelchli merged commit e67842d into master Aug 26, 2022
@awaelchli awaelchli deleted the feature/oss-state-outside-ddp branch August 26, 2022 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement pl Generic label for PyTorch Lightning package ready PRs ready to be merged refactor strategy: fairscale sharded (removed) Sharded Data Parallel
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Support for sharded optimizers when dumping checkpoints outside of the DDP sharded training type plugin
5 participants