Add lazy checkpoint loading for FSDP full-state checkpoints #18150

awaelchli · 2023-07-24T15:41:26Z

What does this PR do?

Addresses (2) in #18008 (comment)
Closes #18138

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

cc @Borda @justusschock @awaelchli @carmocca

for more information, see https://pre-commit.ci

…zy-load

for more information, see https://pre-commit.ci

github-actions · 2023-07-26T15:37:39Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 pytorch_lightning: Benchmarks

Check ID	Status
lightning.Benchmarks	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py, tests/parity_fabric/test_parity_ddp.py.

🟢 fabric: Docs

Check ID	Status
make-doctest (fabric)	success	✅
make-html (fabric)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
fabric-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
fabric-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py, tests/tests_fabric/utilities/test_load.py.

🟢 lightning_fabric: Azure GPU

Check ID	Status
lightning-fabric (GPUs)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py, tests/tests_fabric/utilities/test_load.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

for more information, see https://pre-commit.ci

carmocca

Awesome! Have you thought about other places in the codebase where we could lazy load?

We can now import our utility from lit-gpt

src/lightning/fabric/strategies/fsdp.py

src/lightning/fabric/utilities/load.py

awaelchli · 2023-07-26T20:52:29Z

Have you thought about other places in the codebase where we could lazy load?

It looks like it is difficult to integrate it into the rest of the code base. Several things are standing in the way that require us to brainstorm on design decisions. Here are some raw notes from my notebook:

Lazy tensors can't be loaded by an optimizer, because the optimizer creates a deepcopy, causing pickling issues.
Lazy load bypasses all the torch.load logic, which means for example it doesn't support map_location.
Loading from filsystems via fsspec is not supported
It is hard to integrate lazy load with the existing IO plugin structure, since it needs to be injected at the lowest level where the file gets read, yet has implications on usage at the highest level.
Tensors remain lazy after loading (e.g. on_load_checkpoint hook), the user can't access the weights unless he materializes them manually.

Happy to open an issue about this for further discussion!

We can now import our utility from lit-gpt

There might be slight differences between the version here and in lit-gpt. I need to double check, there was something about quantization.

awaelchli · 2023-07-26T20:58:10Z

src/lightning/fabric/utilities/load.py

+        # TODO: needed for us?
+        # materializing with contiguous is needed for quantization
+        if name in {"contiguous"}:
+            return getattr(self._load_tensor(), name)


I should probably keep this here if we want to import and use the util in lit-gpt?

It might also be a generally applicable limitation. On the other hand, we could do this manually in the quantization code directly.

What do you suggest @t-vi?

awaelchli added 4 commits July 23, 2023 14:07

wip

38cb445

simplify

47ffc6e

clean up

622bb2e

add utility to fsdp

b7a5a26

github-actions bot added the fabric lightning.fabric.Fabric label Jul 24, 2023

pre-commit-ci bot and others added 3 commits July 24, 2023 15:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

4acdc29

for more information, see https://pre-commit.ci

fix

3a4934b

[pre-commit.ci] auto fixes from pre-commit.com hooks

5cfeebb

for more information, see https://pre-commit.ci

awaelchli changed the title ~~WIP: Add lazy checkpoint loading (1/n)~~ WIP: Add lazy checkpoint loading for FSDP Jul 25, 2023

awaelchli changed the title ~~WIP: Add lazy checkpoint loading for FSDP~~ WIP: Add lazy checkpoint loading for FSDP full-state checkpoints Jul 25, 2023

awaelchli and others added 13 commits July 25, 2023 14:48

make function

80e4fae

load

37276b0

[pre-commit.ci] auto fixes from pre-commit.com hooks

82397a7

for more information, see https://pre-commit.ci

note

8781717

materialize

7313e0d

Merge remote-tracking branch 'origin/fabric/lazy-load' into fabric/la…

436dc73

…zy-load

update

eb409b1

recursive materialize

10a59d9

[pre-commit.ci] auto fixes from pre-commit.com hooks

464f131

for more information, see https://pre-commit.ci

fix

f19f09f

clean up

df34359

update

0869294

[pre-commit.ci] auto fixes from pre-commit.com hooks

f9e1a84

for more information, see https://pre-commit.ci

awaelchli marked this pull request as ready for review July 26, 2023 15:37

awaelchli requested review from carmocca and justusschock as code owners July 26, 2023 15:37

awaelchli changed the title ~~WIP: Add lazy checkpoint loading for FSDP full-state checkpoints~~ Add lazy checkpoint loading for FSDP full-state checkpoints Jul 26, 2023

remove unused import

1965283

awaelchli added the feature Is an improvement or enhancement label Jul 26, 2023

awaelchli added the data handling Generic data-related topic label Jul 26, 2023

awaelchli added this to the 2.1 milestone Jul 26, 2023

awaelchli self-assigned this Jul 26, 2023

awaelchli and others added 2 commits July 26, 2023 21:23

pytorch 2.0 only

9cc4e44

[pre-commit.ci] auto fixes from pre-commit.com hooks

ae74fbb

for more information, see https://pre-commit.ci

carmocca approved these changes Jul 26, 2023

View reviewed changes

src/lightning/fabric/strategies/fsdp.py Show resolved Hide resolved

src/lightning/fabric/utilities/load.py Outdated Show resolved Hide resolved

awaelchli commented Jul 26, 2023

View reviewed changes

awaelchli added 5 commits July 26, 2023 23:03

mypy

a592445

better attribute error message

7b5c501

add comment about warnings

b23f728

Add note about original author

b83f4a0

avoid jsonargparse deprecation message

bf78b23

awaelchli requested review from lantiga, Borda and tchaton as code owners July 26, 2023 21:42

Borda approved these changes Jul 26, 2023

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jul 26, 2023

awaelchli merged commit 220e3b8 into master Jul 26, 2023

This was referenced Jul 26, 2023

Refactor load_raw_module_state utility functions in FSDP #18173

Merged

Remove outdated warning about loading full-state checkpoints in FSDP #18208

Merged

awaelchli mentioned this pull request Aug 24, 2023

Add lazy checkpoint loading for FSDP full-state checkpoints (Trainer) #18379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lazy checkpoint loading for FSDP full-state checkpoints #18150

Add lazy checkpoint loading for FSDP full-state checkpoints #18150

awaelchli commented Jul 24, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Jul 26, 2023 •

edited

Loading

carmocca left a comment

awaelchli commented Jul 26, 2023 •

edited

Loading

awaelchli Jul 26, 2023

carmocca Jul 26, 2023

Add lazy checkpoint loading for FSDP full-state checkpoints #18150

Add lazy checkpoint loading for FSDP full-state checkpoints #18150

Conversation

awaelchli commented Jul 24, 2023 • edited by github-actions bot Loading

What does this PR do?

PR review

github-actions bot commented Jul 26, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

carmocca left a comment

Choose a reason for hiding this comment

awaelchli commented Jul 26, 2023 • edited Loading

awaelchli Jul 26, 2023

Choose a reason for hiding this comment

carmocca Jul 26, 2023

Choose a reason for hiding this comment

awaelchli commented Jul 24, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Jul 26, 2023 •

edited

Loading

awaelchli commented Jul 26, 2023 •

edited

Loading