Add lazy checkpoint loading for FSDP full-state checkpoints (Trainer) #18379

awaelchli · 2023-08-24T02:03:46Z

What does this PR do?

Fixes #8043
Follows the same logic as added in #18150

Currently, loading a full checkpoint means replicating it in CPU memory for each process. This can cause CPU OOM on machines with limited CPU RAM. Instead, the approach of this PR is to lazy-load the checkpoint (no memory allocation) and on-the-fly load tensors as they are needed. This reduces the memory usage to only one weight tensor being stored in memory (per process) at a given time during model.load_state_dict() access.

cc @Borda @awaelchli @carmocca

for more information, see https://pre-commit.ci

github-actions · 2023-08-24T10:04:03Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py, tests/tests_pytorch/strategies/test_fsdp.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
[pytorch-lightning (GPUs) (testing Lightning	latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=171269&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e)	success
[pytorch-lightning (GPUs) (testing PyTorch	latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=171269&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f)	success

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py, tests/tests_pytorch/strategies/test_fsdp.py.

🟢 pytorch_lightning: Benchmarks

Check ID	Status
lightning.Benchmarks	success	✅

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py.

🟢 pytorch_lightning: Docs

Check ID	Status
docs-checks (pytorch, doctest)	success	✅
make-html (pytorch)	success	✅

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

carmocca

This feature will likely be shortlived, but it's fine to add until the replacement is in

update

0929ed7

github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 24, 2023

awaelchli added feature Is an improvement or enhancement checkpointing Related to checkpointing strategy: fsdp Fully Sharded Data Parallel and removed pl Generic label for PyTorch Lightning package labels Aug 24, 2023

awaelchli modified the milestones: 2.0.x, 2.1 Aug 24, 2023

materialize

d319270

github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 24, 2023

test

a2c8780

awaelchli force-pushed the feature/fsdp-lazy-load branch from 64218b8 to a2c8780 Compare August 24, 2023 09:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

8bcf6b3

for more information, see https://pre-commit.ci

awaelchli changed the title ~~WIP: FSDP lazy load full state checkpoint~~ FSDP lazy load full state checkpoint Aug 24, 2023

awaelchli changed the title ~~FSDP lazy load full state checkpoint~~ Add lazy checkpoint loading for FSDP full-state checkpoints (Trainer) Aug 24, 2023

awaelchli marked this pull request as ready for review August 24, 2023 10:03

awaelchli requested review from carmocca, justusschock, Borda and williamFalcon as code owners August 24, 2023 10:03

carmocca approved these changes Aug 24, 2023

View reviewed changes

Borda approved these changes Aug 24, 2023

View reviewed changes

mergify bot added the ready PRs ready to be merged label Aug 24, 2023

Merge branch 'master' into feature/fsdp-lazy-load

904d614

awaelchli merged commit e8f3863 into master Aug 24, 2023

awaelchli deleted the feature/fsdp-lazy-load branch August 24, 2023 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lazy checkpoint loading for FSDP full-state checkpoints (Trainer) #18379

Add lazy checkpoint loading for FSDP full-state checkpoints (Trainer) #18379

awaelchli commented Aug 24, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Aug 24, 2023 •

edited

Loading

carmocca left a comment

Add lazy checkpoint loading for FSDP full-state checkpoints (Trainer) #18379

Add lazy checkpoint loading for FSDP full-state checkpoints (Trainer) #18379

Conversation

awaelchli commented Aug 24, 2023 • edited by github-actions bot Loading

What does this PR do?

github-actions bot commented Aug 24, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

carmocca left a comment

Choose a reason for hiding this comment

awaelchli commented Aug 24, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Aug 24, 2023 •

edited

Loading