Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move torchmetrics to device when using FSDP #18954

Merged
merged 14 commits into from
Nov 8, 2023
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Nov 6, 2023

What does this PR do?

Fixes #18888

FSDP doesn't move modules that don't have parameters to the device. TorchMetrics don't have parameters, but require to be move to the right device to handle their state updates on device. This PR checks whether the LightningModule has Metric modules and moves them automatically. This ensures a seamless integration, since metrics are a core feature in Lightning.

Corresponding issue on PyTorch: pytorch/pytorch#113113


📚 Documentation preview 📚: https://pytorch-lightning--18954.org.readthedocs.build/en/18954/

cc @Borda @carmocca @justusschock @awaelchli

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 6, 2023
@awaelchli awaelchli added bug Something isn't working strategy: fsdp Fully Sharded Data Parallel fun Staff contributions outside working hours - to differentiate from the "community" label labels Nov 6, 2023
@awaelchli awaelchli added this to the 2.1.x milestone Nov 6, 2023
@awaelchli awaelchli marked this pull request as ready for review November 6, 2023 03:02
Copy link
Contributor

github-actions bot commented Nov 6, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.12, oldest) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.10, 2.1) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.12, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1) success
pl-cpu (windows-2022, lightning, 3.8, 1.12, oldest) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.10, 2.1) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success
pl-cpu (macOS-12, pytorch, 3.11, 2.0) success
pl-cpu (macOS-12, pytorch, 3.11, 2.1) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.1) success
pl-cpu (windows-2022, pytorch, 3.11, 2.0) success
pl-cpu (windows-2022, pytorch, 3.11, 2.1) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/pytorch/strategies/fsdp.py, tests/tests_pytorch/strategies/test_fsdp.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) (testing Lightning | latest) success
pytorch-lightning (GPUs) (testing PyTorch | latest) success

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py, tests/tests_pytorch/strategies/test_fsdp.py, src/lightning/fabric/strategies/fsdp.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/pytorch/strategies/fsdp.py.

🟢 fabric: Docs
Check ID Status
docs-make (fabric, doctest) success
docs-make (fabric, html) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to src/lightning/pytorch/strategies/fsdp.py.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.12, oldest) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.10, 2.0) success
fabric-cpu (macOS-11, lightning, 3.11, 2.1) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.12, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.1) success
fabric-cpu (windows-2022, lightning, 3.8, 1.12, oldest) success
fabric-cpu (windows-2022, lightning, 3.9, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.10, 2.0) success
fabric-cpu (windows-2022, lightning, 3.11, 2.1) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success
fabric-cpu (macOS-12, fabric, 3.11, 2.0) success
fabric-cpu (macOS-12, fabric, 3.11, 2.1) success
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0) success
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.1) success
fabric-cpu (windows-2022, fabric, 3.11, 2.0) success
fabric-cpu (windows-2022, fabric, 3.11, 2.1) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) (testing Fabric | latest) success
lightning-fabric (GPUs) (testing Lightning | latest) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/pytorch/strategies/fsdp.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/fabric/strategies/fsdp.py, src/lightning/pytorch/strategies/fsdp.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #18954 (41fc4bd) into master (62771f3) will decrease coverage by 24%.
Report is 6 commits behind head on master.
The diff coverage is 78%.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #18954     +/-   ##
=========================================
- Coverage      75%      51%    -24%     
=========================================
  Files         450      445      -5     
  Lines       36150    36145      -5     
=========================================
- Hits        27247    18476   -8771     
- Misses       8903    17669   +8766     

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this be done in Fabric too?

@awaelchli
Copy link
Contributor Author

I would prefer to first get a comment on the approach #18954 (comment) before doing it.

@mergify mergify bot added the has conflicts label Nov 6, 2023
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Nov 7, 2023
@mergify mergify bot removed the has conflicts label Nov 7, 2023
tests/tests_pytorch/strategies/test_fsdp.py Show resolved Hide resolved
src/lightning/fabric/strategies/fsdp.py Outdated Show resolved Hide resolved
src/lightning/fabric/strategies/fsdp.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Nov 8, 2023
@awaelchli awaelchli merged commit 964364b into master Nov 8, 2023
115 of 117 checks passed
@awaelchli awaelchli deleted the bugfix/metrics-fsdp branch November 8, 2023 20:29
Borda pushed a commit that referenced this pull request Nov 14, 2023
lantiga pushed a commit that referenced this pull request Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fabric lightning.fabric.Fabric fun Staff contributions outside working hours - to differentiate from the "community" label pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: fsdp Fully Sharded Data Parallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When Using FSDP Strategy, Lightning Does not Move TorchMetrics to Device (Torch 2.1.0)
3 participants