Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lazy checkpoint loading for FSDP full-state checkpoints #18150

Merged
merged 28 commits into from
Jul 26, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jul 24, 2023

What does this PR do?

Addresses (2) in #18008 (comment)
Closes #18138

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

cc @Borda @justusschock @awaelchli @carmocca

@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Jul 24, 2023
@awaelchli awaelchli changed the title WIP: Add lazy checkpoint loading (1/n) WIP: Add lazy checkpoint loading for FSDP Jul 25, 2023
@awaelchli awaelchli changed the title WIP: Add lazy checkpoint loading for FSDP WIP: Add lazy checkpoint loading for FSDP full-state checkpoints Jul 25, 2023
@awaelchli awaelchli marked this pull request as ready for review July 26, 2023 15:37
@awaelchli awaelchli changed the title WIP: Add lazy checkpoint loading for FSDP full-state checkpoints Add lazy checkpoint loading for FSDP full-state checkpoints Jul 26, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jul 26, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.8, 1.11) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py, tests/parity_fabric/test_parity_ddp.py.

🟢 fabric: Docs
Check ID Status
make-doctest (fabric) success
make-html (fabric) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.10, 2.0) success
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11) success
fabric-cpu (windows-2022, lightning, 3.9, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.10, 2.0) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py, tests/tests_fabric/utilities/test_load.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py, tests/tests_fabric/utilities/test_load.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/fabric/utilities/load.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli added the feature Is an improvement or enhancement label Jul 26, 2023
@awaelchli awaelchli added the data handling Generic data-related topic label Jul 26, 2023
@awaelchli awaelchli added this to the 2.1 milestone Jul 26, 2023
@awaelchli awaelchli self-assigned this Jul 26, 2023
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Have you thought about other places in the codebase where we could lazy load?

We can now import our utility from lit-gpt

src/lightning/fabric/strategies/fsdp.py Show resolved Hide resolved
src/lightning/fabric/utilities/load.py Outdated Show resolved Hide resolved
@awaelchli
Copy link
Contributor Author

awaelchli commented Jul 26, 2023

Have you thought about other places in the codebase where we could lazy load?

It looks like it is difficult to integrate it into the rest of the code base. Several things are standing in the way that require us to brainstorm on design decisions. Here are some raw notes from my notebook:

  • Lazy tensors can't be loaded by an optimizer, because the optimizer creates a deepcopy, causing pickling issues.
  • Lazy load bypasses all the torch.load logic, which means for example it doesn't support map_location.
  • Loading from filsystems via fsspec is not supported
  • It is hard to integrate lazy load with the existing IO plugin structure, since it needs to be injected at the lowest level where the file gets read, yet has implications on usage at the highest level.
  • Tensors remain lazy after loading (e.g. on_load_checkpoint hook), the user can't access the weights unless he materializes them manually.

Happy to open an issue about this for further discussion!

We can now import our utility from lit-gpt

There might be slight differences between the version here and in lit-gpt. I need to double check, there was something about quantization.

Comment on lines 154 to 157
# TODO: needed for us?
# materializing with contiguous is needed for quantization
if name in {"contiguous"}:
return getattr(self._load_tensor(), name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably keep this here if we want to import and use the util in lit-gpt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be a generally applicable limitation. On the other hand, we could do this manually in the quantization code directly.

What do you suggest @t-vi?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic fabric lightning.fabric.Fabric feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants