Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix num_nodes not set for DDPFullyShardedNativeStrategy #17160

Merged
merged 8 commits into from
Mar 29, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Mar 21, 2023

What does this PR do?

Fixes #17028

This bug only affects the DDPFullyShardedNativeStrategy, because all others have num_nodes defined as a public setter that sets the protected variable internally. This bug only affects 1.9.x.

cc @Borda

@awaelchli awaelchli added bug Something isn't working strategy: fairscale fsdp (removed) Fully Sharded Data Parallel labels Mar 21, 2023
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Mar 21, 2023
@awaelchli awaelchli added this to the v1.9.x milestone Mar 21, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Mar 21, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, pytorch, 3.8, 1.11) success
pl-cpu (macOS-11, pytorch, 3.9, 1.12) success
pl-cpu (macOS-11, pytorch, 3.10, 1.13) success
pl-cpu (macOS-11, pytorch, 3.8, 1.10, oldest) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.10) success
pl-cpu (ubuntu-20.04, pytorch, 3.9, 1.11) success
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.12) success
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.7, 1.10, oldest) success
pl-cpu (windows-2022, pytorch, 3.9, 1.11) success
pl-cpu (windows-2022, pytorch, 3.10, 1.12) success
pl-cpu (windows-2022, pytorch, 3.10, 1.13) success
pl-cpu (windows-2022, pytorch, 3.7, 1.10, oldest) success
pl-cpu (slow, macOS-11, pytorch, 3.7, 1.11) success
pl-cpu (slow, ubuntu-20.04, pytorch, 3.7, 1.11) success
pl-cpu (slow, windows-2022, pytorch, 3.7, 1.11) success
pl-cpu (macOS-11, lightning, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.13) success
pl-cpu (windows-2022, lightning, 3.8, 1.13) success

These checks are required after the changes to requirements/fabric/base.txt, requirements/pytorch/base.txt, src/pytorch_lightning/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) success

These checks are required after the changes to requirements/pytorch/base.txt, src/pytorch_lightning/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, requirements/fabric/base.txt.

🟢 pytorch_lightning: Benchmarks
Check ID Status
pytorch-lightning.Benchmark success

These checks are required after the changes to requirements/pytorch/base.txt.

🟢 pytorch_lightning: Azure HPU
Check ID Status
pytorch-lightning (HPUs) success

These checks are required after the changes to requirements/fabric/base.txt, requirements/pytorch/base.txt, src/pytorch_lightning/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py.

🟢 pytorch_lightning: Docs
Check ID Status
make-doctest (pytorch) success
make-html (pytorch) success

These checks are required after the changes to src/pytorch_lightning/trainer/connectors/accelerator_connector.py, requirements/pytorch/base.txt.

🟢 pytorch_lightning: Docker
Check ID Status
build-cuda (3.8, 1.10, 11.3.1) success
build-cuda (3.8, 1.11, 11.3.1) success
build-cuda (3.8, 1.12, 11.3.1) success
build-cuda (3.8, 1.13, 11.6.1) success
build-hpu (1.5.0, 1.11.0) success
build-ipu (3.9, 1.10) success
build-NGC success
build-pl (3.8, 1.10, 11.3.1) success
build-pl (3.8, 1.11, 11.3.1) success
build-pl (3.8, 1.12, 11.3.1) success
build-pl (3.8, 1.13, 11.6.1) success
build-xla (3.7, 1.12) success

These checks are required after the changes to requirements/pytorch/base.txt, requirements/fabric/base.txt.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, fabric, 3.8, 1.11) success
fabric-cpu (macOS-11, fabric, 3.9, 1.12) success
fabric-cpu (macOS-11, fabric, 3.10, 1.13) success
fabric-cpu (macOS-11, fabric, 3.7, 1.10, oldest) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.10) success
fabric-cpu (ubuntu-20.04, fabric, 3.9, 1.11) success
fabric-cpu (ubuntu-20.04, fabric, 3.10, 1.12) success
fabric-cpu (ubuntu-20.04, fabric, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.7, 1.10, oldest) success
fabric-cpu (windows-2022, fabric, 3.9, 1.11) success
fabric-cpu (windows-2022, fabric, 3.10, 1.12) success
fabric-cpu (windows-2022, fabric, 3.10, 1.13) success
fabric-cpu (windows-2022, fabric, 3.7, 1.10, oldest) success
fabric-cpu (macOS-11, lightning, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.13) success
fabric-cpu (windows-2022, lightning, 3.8, 1.13) success

These checks are required after the changes to requirements/fabric/base.txt.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) success

These checks are required after the changes to requirements/fabric/base.txt.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to requirements/fabric/base.txt, requirements/pytorch/base.txt, src/pytorch_lightning/trainer/connectors/accelerator_connector.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.7) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.7) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.7) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.7) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.7) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.7) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.7) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.7) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.7) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.7) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.7) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.7) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.7) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.7) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.7) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to src/pytorch_lightning/trainer/connectors/accelerator_connector.py, requirements/fabric/base.txt, requirements/pytorch/base.txt.

🟢 link-check
Check ID Status
markdown-link-check success

These checks are required after the changes to src/pytorch_lightning/CHANGELOG.md.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli force-pushed the bugfix/ddp-sharded-num-nodes branch from c5a5bb1 to a2021fc Compare March 21, 2023 10:30
@awaelchli awaelchli force-pushed the bugfix/ddp-sharded-num-nodes branch from 34ee803 to 37e4e91 Compare March 21, 2023 10:53
@mergify mergify bot added the ready PRs ready to be merged label Mar 24, 2023
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Mar 26, 2023
@codecov
Copy link

codecov bot commented Mar 26, 2023

Codecov Report

Merging #17160 (3e123a3) into release/LTS (571ffd8) will decrease coverage by 19%.
The diff coverage is 100%.

Additional details and impacted files
@@              Coverage Diff               @@
##           release/LTS   #17160     +/-   ##
==============================================
- Coverage           82%      62%    -19%     
==============================================
  Files              476      435     -41     
  Lines            35264    34773    -491     
==============================================
- Hits             28743    21564   -7179     
- Misses            6521    13209   +6688     

@awaelchli awaelchli force-pushed the bugfix/ddp-sharded-num-nodes branch from c048fba to ac246fa Compare March 27, 2023 08:21
@Borda
Copy link
Member

Borda commented Mar 27, 2023

fixing links in #17197

@Borda Borda enabled auto-merge (squash) March 27, 2023 08:30
@Borda Borda merged commit b8887f6 into release/LTS Mar 29, 2023
@Borda Borda deleted the bugfix/ddp-sharded-num-nodes branch March 29, 2023 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: fairscale fsdp (removed) Fully Sharded Data Parallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants