Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DDP subprocess the default launcher for multi-device #16780

Merged
merged 16 commits into from
Feb 20, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Feb 16, 2023

What does this PR do?

When using multiple devices, the strategy now defaults to "ddp" instead of "ddp_spawn" when none is set.

Before:

# in non-interactive env, implies strategy=ddp_spawn
# in Jupyter notebooks, implies strategy=ddp_notebook
trainer = Trainer(devices=2)

Now:

# in non-interactive env, implies strategy=ddp
# in Jupyter notebooks, implies strategy=ddp_notebook
trainer = Trainer(devices=2)

# Still works:
trainer = Trainer(devices=2, strategy="ddp_spawn")

DDP-spawn is good for debugging and testing, but it has "invisible" process boundaries and especially beginner users don't know that. Since the processes join after Trainer.fit() etc., the state is not updated (only the model params are transferred), and this can lead to unexpected behavior. Very early on in Lightnings life, spawn also used to work in notebooks which was the primary reason why it was chosen as default. But support for that was later dropped in PyTorch and we introduced ddp-fork as an alternative later on.

Over time we started discouraging ddp-spawn in docs, tutorials etc. and always promoted explicitly setting strategy=ddp. With Lightning 2.0, we are now ready to switch fully over to ddp as the default multi-device strategy for small to medium sized models.

cc @Borda @justusschock @carmocca @awaelchli

@awaelchli awaelchli added breaking change Includes a breaking change strategy: ddp DistributedDataParallel strategy: ddp spawn labels Feb 16, 2023
@awaelchli awaelchli added this to the 2.0 milestone Feb 16, 2023
@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Feb 16, 2023
@awaelchli awaelchli self-assigned this Feb 16, 2023
@awaelchli awaelchli marked this pull request as ready for review February 16, 2023 11:20
@github-actions
Copy link
Contributor

github-actions bot commented Feb 16, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.9, 1.11) success
pl-cpu (windows-2022, lightning, 3.10, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py, src/lightning/fabric/connector.py.

🟢 pytorch_lightning: Azure HPU
Check ID Status
pytorch-lightning (HPUs) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure IPU
Check ID Status
pytorch-lightning (IPUs) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Docs
Check ID Status
make-doctest (pytorch) success
make-html (pytorch) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, docs/source-pytorch/accelerators/gpu_faq.rst, docs/source-pytorch/accelerators/gpu_intermediate.rst, docs/source-pytorch/common/trainer.rst, docs/source-pytorch/extensions/strategy.rst, docs/source-pytorch/starter/style_guide.rst.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.11) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
fabric-cpu (windows-2022, lightning, 3.9, 1.11) success
fabric-cpu (windows-2022, lightning, 3.10, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success

These checks are required after the changes to src/lightning/fabric/connector.py, tests/tests_fabric/test_connector.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) success

These checks are required after the changes to src/lightning/fabric/connector.py, tests/tests_fabric/test_connector.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 link-check
Check ID Status
markdown-link-check success

These checks are required after the changes to src/lightning/fabric/CHANGELOG.md, src/lightning/pytorch/CHANGELOG.md.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@codecov
Copy link

codecov bot commented Feb 16, 2023

Codecov Report

Merging #16780 (d988f24) into master (6950a07) will decrease coverage by 22%.
The diff coverage is 100%.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #16780     +/-   ##
=========================================
- Coverage      82%      59%    -22%     
=========================================
  Files         437      412     -25     
  Lines       31583    31280    -303     
=========================================
- Hits        25784    18534   -7250     
- Misses       5799    12746   +6947     

@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Feb 20, 2023
@awaelchli awaelchli enabled auto-merge (squash) February 20, 2023 09:58
@awaelchli awaelchli merged commit 81b7c30 into master Feb 20, 2023
@awaelchli awaelchli deleted the feature/ddp-default branch February 20, 2023 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Includes a breaking change fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants