Sharded state dicts save correctly when `save_weights_only=True` #19524

dimitri-voytan · 2024-02-23T21:38:53Z

What does this PR do?

Fixes #19492 for FSDP sharded state_dicts. Optimizer states are default to an empty list if they are not in the state_dict, which can happen when the model checkpoint callback uses save_weights_only=True

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
[ ]Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19524.org.readthedocs.build/en/19524/

for more information, see https://pre-commit.ci

…an/pytorch-lightning into fix/fsdp-checkpoint

for more information, see https://pre-commit.ci

dimitri-voytan · 2024-02-27T22:18:05Z

@awaelchli

This works now in that the test fails on master and passes with the corrections. It could improve with verification that the checkpoints are useable in the lightning ecosystem downstream. If this is of interest, can I ask the lightning team for some help as I don't have a lot of free time to work on this.

awaelchli

@dimitri-voytan Great fix, thanks a lot!

I integrated your test in an existing one we already had for weights_only=True, we just needed to add the parameterization for sharded checkpoints.

) Co-authored-by: Dimitri <[email protected]> Co-authored-by: awaelchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>

dvoytan-spark and others added 3 commits February 23, 2024 15:25

test saving sharded weights only

2e54ca5

add fix for weights only

3b35e2c

Merge branch 'Lightning-AI:master' into fix/fsdp-checkpoint

9f5db5a

github-actions bot added the pl Generic label for PyTorch Lightning package label Feb 23, 2024

pre-commit-ci bot and others added 5 commits February 23, 2024 21:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

99414fc

for more information, see https://pre-commit.ci

focus test on just full vs sharded

fbc9985

Merge branch 'fix/fsdp-checkpoint' of https://github.com/dimitri-voyt…

80efc9c

…an/pytorch-lightning into fix/fsdp-checkpoint

[pre-commit.ci] auto fixes from pre-commit.com hooks

1fe2038

for more information, see https://pre-commit.ci

Merge branch 'master' into fix/fsdp-checkpoint

108ac6b

dimitri-voytan marked this pull request as ready for review February 27, 2024 23:01

dimitri-voytan requested review from awaelchli, carmocca, justusschock, Borda and williamFalcon as code owners February 27, 2024 23:01

awaelchli added strategy: fsdp Fully Sharded Data Parallel bug Something isn't working community This PR is from the community labels Feb 29, 2024

awaelchli added this to the 2.2.x milestone Feb 29, 2024

awaelchli self-assigned this Feb 29, 2024

awaelchli added 2 commits March 8, 2024 03:08

reuse existing test

89a80c1

update chlog

67efd9c

mergify bot added the has conflicts label Mar 8, 2024

awaelchli approved these changes Mar 8, 2024

View reviewed changes

Merge branch 'master' into fix/fsdp-checkpoint

b27e620

mergify bot removed the has conflicts label Mar 8, 2024

fix precommit

c361075

Borda approved these changes Mar 8, 2024

View reviewed changes

mergify bot added the ready PRs ready to be merged label Mar 8, 2024

carmocca approved these changes Mar 8, 2024

View reviewed changes

awaelchli and others added 3 commits March 8, 2024 12:04

oops

53dd322

Merge branch 'master' into fix/fsdp-checkpoint

d132d9d

Merge branch 'master' into fix/fsdp-checkpoint

81e693c

awaelchli merged commit b3275e0 into Lightning-AI:master Mar 13, 2024
85 of 86 checks passed

awaelchli added a commit that referenced this pull request Apr 10, 2024

Sharded state dicts save correctly when save_weights_only=True (#19524

3c7181a

) Co-authored-by: Dimitri <[email protected]> Co-authored-by: awaelchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>

lantiga pushed a commit that referenced this pull request Apr 11, 2024

Sharded state dicts save correctly when save_weights_only=True (#19524

998314a

) Co-authored-by: Dimitri <[email protected]> Co-authored-by: awaelchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded state dicts save correctly when `save_weights_only=True` #19524

Sharded state dicts save correctly when `save_weights_only=True` #19524

dimitri-voytan commented Feb 23, 2024 •

edited

Loading

dimitri-voytan commented Feb 27, 2024

awaelchli left a comment

Sharded state dicts save correctly when save_weights_only=True #19524

Sharded state dicts save correctly when save_weights_only=True #19524

Conversation

dimitri-voytan commented Feb 23, 2024 • edited Loading

What does this PR do?

PR review

dimitri-voytan commented Feb 27, 2024

awaelchli left a comment

Choose a reason for hiding this comment

Sharded state dicts save correctly when `save_weights_only=True` #19524

Sharded state dicts save correctly when `save_weights_only=True` #19524

dimitri-voytan commented Feb 23, 2024 •

edited

Loading