Save and load sharded checkpoints with FSDP in Fabric #17323

awaelchli · 2023-04-11T04:46:49Z

What does this PR do?

This PR enables the following:

fabric = Fabric(strategy="fsdp", devices=2)

# this works now:
# (key names can be chosen freely by user)
checkpoint = {"model": model, "optimizer": optimizer, "other": "anything}
fabric.save(path, checkpoint)

# this works now:
fabric.load(path, checkpoint)

The checkpoint file structure looks like this (if devices=2):

os.listdir(path)
["meta.pt", ".metadata", "__0_0.distcp", "__1_0.distcp"]

The ".metadata" file is from the FSDP file writer, the "*.distcp" are the distributed checkpoint files holding the tensors, and the "meta.pt" is a file that Fabric's FSDPStrategy saves with all user dict data next to model and optimizer (from the example above: {"other": "anything})

Future Work

This is a minimal implementation for sharded checkpointing and loading. It is the best choice for large models and is the most memory efficient that FSDP can offer right now (offload to CPU, sharded state dict, chunk-wise filewriter). In the future, we need to

Support saving and loading full-state dict as well (through a flag). This is important for use cases where we for example load a pretrained model from a single file and load it into an FSDP model.
Enable the feature for torch < 2.0. This requires additional testing since some APIs/imports have changed slightly.

While testing, I stumbled upon this bug in PyTorch: pytorch/pytorch#99079

cc @Borda @awaelchli @carmocca @justusschock

for more information, see https://pre-commit.ci

…nto fsdp-checkpoint

for more information, see https://pre-commit.ci

…point

for more information, see https://pre-commit.ci

…point

for more information, see https://pre-commit.ci

tests/tests_fabric/helpers/models.py

…nto fsdp-checkpoint

github-actions · 2023-04-14T03:41:14Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅

These checks are required after the changes to .github/workflows/ci-tests-pytorch.yml, src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/deepspeed.py, src/lightning/fabric/strategies/fsdp.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/deepspeed.py, src/lightning/fabric/strategies/fsdp.py.

🟢 fabric: Docs

Check ID	Status
make-doctest (fabric)	success	✅
make-html (fabric)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/deepspeed.py, src/lightning/fabric/strategies/fsdp.py.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
fabric-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
fabric-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/deepspeed.py, src/lightning/fabric/strategies/fsdp.py, tests/tests_fabric/helpers/models.py, tests/tests_fabric/strategies/test_deepspeed_integration.py, tests/tests_fabric/strategies/test_fsdp.py, tests/tests_fabric/strategies/test_fsdp_integration.py, tests/tests_fabric/test_fabric.py.

🟢 lightning_fabric: Azure GPU

Check ID	Status
lightning-fabric (GPUs)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/deepspeed.py, src/lightning/fabric/strategies/fsdp.py, tests/tests_fabric/helpers/models.py, tests/tests_fabric/strategies/test_deepspeed_integration.py, tests/tests_fabric/strategies/test_fsdp.py, tests/tests_fabric/strategies/test_fsdp_integration.py, tests/tests_fabric/test_fabric.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/deepspeed.py, src/lightning/fabric/strategies/fsdp.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/strategies/deepspeed.py, src/lightning/fabric/strategies/fsdp.py.

🟢 link-check

Check ID	Status
check-md-links / markdown-link-check	success	✅

These checks are required after the changes to src/lightning/fabric/CHANGELOG.md.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

src/lightning/fabric/strategies/fsdp.py

lantiga

Awesome!

src/lightning/fabric/strategies/fsdp.py

Co-authored-by: Luca Antiga <[email protected]>

Co-authored-by: Jirka Borovec <[email protected]>

carmocca

I don't think that porting this to support torch less than 2.0 is important. If you are worried about silent errors, we can raise an error at the start of FSDP if the torch version is lower than 2.0, suggesting to upgrade.

src/lightning/fabric/fabric.py

codecov · 2023-04-14T22:17:08Z

Codecov Report

Merging #17323 (bb2e0db) into master (8e7b949) will decrease coverage by 24%.
The diff coverage is 46%.

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #17323     +/-   ##
=========================================
- Coverage      83%      59%    -24%     
=========================================
  Files         415      410      -5     
  Lines       31437    31427     -10     
=========================================
- Hits        26048    18596   -7452     
- Misses       5389    12831   +7442

for more information, see https://pre-commit.ci

…nto fsdp-checkpoint

awaelchli added 5 commits April 7, 2023 19:59

wip

751ea2d

save

b18172c

wip

0f916ea

test

fad290b

remove temp files

afdedb3

github-actions bot added the fabric lightning.fabric.Fabric label Apr 11, 2023

awaelchli added feature Is an improvement or enhancement checkpointing Related to checkpointing strategy: fsdp Fully Sharded Data Parallel labels Apr 11, 2023

awaelchli added this to the 2.1 milestone Apr 11, 2023

awaelchli and others added 5 commits April 11, 2023 00:47

remove temp files

fa0786c

[pre-commit.ci] auto fixes from pre-commit.com hooks

559400a

for more information, see https://pre-commit.ci

test

60476d5

Merge branch 'fsdp-checkpoint' of github.com:Lightning-AI/lightning i…

9334bf8

…nto fsdp-checkpoint

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad07d46

for more information, see https://pre-commit.ci

awaelchli mentioned this pull request Apr 11, 2023

Proper extraction of state_dict for fsdp strategy #16526

Closed

awaelchli and others added 14 commits April 13, 2023 13:49

Merge branch 'master' into fsdp-checkpoint

faa9cab

input validation

2e82f30

[pre-commit.ci] auto fixes from pre-commit.com hooks

6a38afa

for more information, see https://pre-commit.ci

refactor

1505bad

Merge remote-tracking branch 'origin/fsdp-checkpoint' into fsdp-check…

6fc9e7c

…point

[pre-commit.ci] auto fixes from pre-commit.com hooks

2686c75

for more information, see https://pre-commit.ci

tests

55e6a04

Merge remote-tracking branch 'origin/fsdp-checkpoint' into fsdp-check…

eaf6bd0

…point

update

d551b7e

no optimizers

2d58843

no optimizers

8c0721e

no optimizers

2a380d1

optimizers

62270fa

[pre-commit.ci] auto fixes from pre-commit.com hooks

f11f638

for more information, see https://pre-commit.ci

awaelchli commented Apr 14, 2023

View reviewed changes

tests/tests_fabric/helpers/models.py Show resolved Hide resolved

awaelchli added 3 commits April 13, 2023 23:03

fix

bbc82e3

Merge branch 'fsdp-checkpoint' of github.com:Lightning-AI/lightning i…

f9906c4

…nto fsdp-checkpoint

typing

0751970

awaelchli changed the title ~~WIP: Save and load sharded checkpoints with FSDP in Fabric~~ Save and load sharded checkpoints with FSDP in Fabric Apr 14, 2023

awaelchli marked this pull request as ready for review April 14, 2023 03:40

awaelchli requested review from carmocca and justusschock as code owners April 14, 2023 03:40

Borda approved these changes Apr 14, 2023

View reviewed changes

src/lightning/fabric/strategies/fsdp.py Outdated Show resolved Hide resolved

src/lightning/fabric/strategies/fsdp.py Outdated Show resolved Hide resolved

lantiga approved these changes Apr 14, 2023

View reviewed changes

src/lightning/fabric/strategies/fsdp.py Outdated Show resolved Hide resolved

src/lightning/fabric/strategies/fsdp.py Outdated Show resolved Hide resolved

mergify bot added the ready PRs ready to be merged label Apr 14, 2023

awaelchli and others added 3 commits April 14, 2023 05:25

Update src/lightning/fabric/strategies/fsdp.py

9a225a0

Co-authored-by: Luca Antiga <[email protected]>

Update src/lightning/fabric/strategies/fsdp.py

e14ce76

Co-authored-by: Jirka Borovec <[email protected]>

Update src/lightning/fabric/strategies/fsdp.py

855b05f

Co-authored-by: Jirka Borovec <[email protected]>

carmocca approved these changes Apr 14, 2023

View reviewed changes

src/lightning/fabric/fabric.py Outdated Show resolved Hide resolved

Merge branch 'master' into fsdp-checkpoint

bb2e0db

awaelchli mentioned this pull request Apr 15, 2023

Model compilation support Lightning-AI/lit-llama#62

Open

awaelchli and others added 7 commits April 16, 2023 00:57

fix loading optim state

47561a1

[pre-commit.ci] auto fixes from pre-commit.com hooks

b353326

for more information, see https://pre-commit.ci

fix mypy

cbeef07

Merge branch 'fsdp-checkpoint' of github.com:Lightning-AI/lightning i…

7872dac

…nto fsdp-checkpoint

Merge branch 'master' into fsdp-checkpoint

3e911d4

update ci

41e7110

Merge branch 'fsdp-checkpoint' of github.com:Lightning-AI/lightning i…

7718cb8

…nto fsdp-checkpoint

github-actions bot added the ci Continuous Integration label Apr 16, 2023

awaelchli merged commit 0dc42f5 into master Apr 16, 2023

awaelchli deleted the fsdp-checkpoint branch April 16, 2023 18:11

awaelchli mentioned this pull request May 13, 2023

Enable loading full state dict checkpoints with FSDP #17623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save and load sharded checkpoints with FSDP in Fabric #17323

Save and load sharded checkpoints with FSDP in Fabric #17323

awaelchli commented Apr 11, 2023 •

edited

Loading

github-actions bot commented Apr 14, 2023 •

edited

Loading

lantiga left a comment

carmocca left a comment

codecov bot commented Apr 14, 2023

Save and load sharded checkpoints with FSDP in Fabric #17323

Save and load sharded checkpoints with FSDP in Fabric #17323

Conversation

awaelchli commented Apr 11, 2023 • edited Loading

What does this PR do?

Future Work

github-actions bot commented Apr 14, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

lantiga left a comment

Choose a reason for hiding this comment

carmocca left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 14, 2023

Codecov Report

awaelchli commented Apr 11, 2023 •

edited

Loading

github-actions bot commented Apr 14, 2023 •

edited

Loading