-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sharded optimizer state dumping outside of sharded strategies #14208
Conversation
fffb49c
to
3ab0c6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation and test-wise this is fine, I just really don't like the solution we opted for (I know, I'm late to the party, but I missed the issue).
This feels like negating all the efforts we made during the refactor to separate concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be breaking the strategy encapsulation, and now since PyTorch copied Zero from deepspeed, we need to provide special support for it.
Why not let users override the Strategy
if they are mixing the configuration? or use deepspeed stage 1 if they want to use Zero or some special wrapper?
Also, I remember @otaj is working on checkpoint-related part which will let users choose how the checkpoint is created. So maybe that could be a better way to handle such configuration?
I agree, this does not fit well into our strategy design. It is also not my personal opinion that we should do it that way. But in sprint planning it was determined that it is important to work on this issue and I was assigned to it. So I am going to complete the task regardless.
It is not the responsibility of the checkpoint callback to know WHAT to save. Therefore it won't belong there. |
can we discuss this again today or in the retro?
I didn't say checkpoint callback. |
Should this PR and the related issue be closed?
|
This PR resolves an existing issue in our codebase. Whether the patch is perfect design-wise should be secondary to the improvement in stability. The latter can be improved with time or on a follow-up as designs mature, especially if there's no clear alternative at the moment. My vote is 👍 |
Codecov Report
@@ Coverage Diff @@
## master #14208 +/- ##
=========================================
+ Coverage 61% 76% +15%
=========================================
Files 332 332
Lines 26852 26883 +31
=========================================
+ Hits 16421 20428 +4007
+ Misses 10431 6455 -3976 |
What does this PR do?
Fixes #6387
Redo of #11867
Motivation: User wants to checkpoint the sharded optimizer outside of ddp/sharded strategy, e.g. not using it.
This PR implements it exactly as proposed in #6387 and #11867. However, this leaks fairscale specific logic into the base strategy. We have some options to mitigate the issue:
Option 0:
Do not care about this. Move forward with this PR as is.
Option A:
Only apply the current approach to the native sharded optimizer from torch. For Fairscale, the user is still forced to use the dedicated strategy and we will keep the logic for this one in the Fairscale sharded strategy.
Option B:
Instead of moving it to the top into the base strategy, move it to the parallel strategy. This means you will have to use at least a strategy where parallel execution is assumed. Otherwise, the logic stays the same.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
I made sure I had fun coding 🙃
cc @Borda @justusschock @awaelchli @rohitgr7 @akihironitta