Make DDP subprocess the default launcher for multi-device #16780

awaelchli · 2023-02-16T03:36:37Z

What does this PR do?

When using multiple devices, the strategy now defaults to "ddp" instead of "ddp_spawn" when none is set.

Before:

# in non-interactive env, implies strategy=ddp_spawn
# in Jupyter notebooks, implies strategy=ddp_notebook
trainer = Trainer(devices=2)

Now:

# in non-interactive env, implies strategy=ddp
# in Jupyter notebooks, implies strategy=ddp_notebook
trainer = Trainer(devices=2)

# Still works:
trainer = Trainer(devices=2, strategy="ddp_spawn")

DDP-spawn is good for debugging and testing, but it has "invisible" process boundaries and especially beginner users don't know that. Since the processes join after Trainer.fit() etc., the state is not updated (only the model params are transferred), and this can lead to unexpected behavior. Very early on in Lightnings life, spawn also used to work in notebooks which was the primary reason why it was chosen as default. But support for that was later dropped in PyTorch and we introduced ddp-fork as an alternative later on.

Over time we started discouraging ddp-spawn in docs, tutorials etc. and always promoted explicitly setting strategy=ddp. With Lightning 2.0, we are now ready to switch fully over to ddp as the default multi-device strategy for small to medium sized models.

cc @Borda @justusschock @carmocca @awaelchli

for more information, see https://pre-commit.ci

…e/ddp-default

for more information, see https://pre-commit.ci

github-actions · 2023-02-16T11:20:38Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py, src/lightning/fabric/connector.py.

🟢 pytorch_lightning: Azure HPU

Check ID	Status
pytorch-lightning (HPUs)	success	✅

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure IPU

Check ID	Status
pytorch-lightning (IPUs)	success	✅

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/accelerators/test_common.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/models/test_amp.py, tests/tests_pytorch/models/test_gpu.py, tests/tests_pytorch/strategies/test_ddp.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py, tests/tests_pytorch/strategies/test_ddp_strategy.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, tests/tests_pytorch/trainer/test_supporters.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Docs

Check ID	Status
make-doctest (pytorch)	success	✅
make-html (pytorch)	success	✅

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, docs/source-pytorch/accelerators/gpu_faq.rst, docs/source-pytorch/accelerators/gpu_intermediate.rst, docs/source-pytorch/common/trainer.rst, docs/source-pytorch/extensions/strategy.rst, docs/source-pytorch/starter/style_guide.rst.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
fabric-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.11)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.12)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.9, 1.11)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 1.12)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/fabric/connector.py, tests/tests_fabric/test_connector.py.

🟢 lightning_fabric: Azure GPU

Check ID	Status
lightning-fabric (GPUs)	success	✅

These checks are required after the changes to src/lightning/fabric/connector.py, tests/tests_fabric/test_connector.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 link-check

Check ID	Status
markdown-link-check	success	✅

These checks are required after the changes to src/lightning/fabric/CHANGELOG.md, src/lightning/pytorch/CHANGELOG.md.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

codecov · 2023-02-16T11:37:14Z

Codecov Report

Merging #16780 (d988f24) into master (6950a07) will decrease coverage by 22%.
The diff coverage is 100%.

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #16780     +/-   ##
=========================================
- Coverage      82%      59%    -22%     
=========================================
  Files         437      412     -25     
  Lines       31583    31280    -303     
=========================================
- Hits        25784    18534   -7250     
- Misses       5799    12746   +6947

docs/source-pytorch/accelerators/gpu_intermediate.rst

src/lightning/fabric/connector.py

src/lightning/pytorch/trainer/connectors/accelerator_connector.py

switch the default ddp launcher

899b16c

awaelchli added breaking change Includes a breaking change strategy: ddp DistributedDataParallel strategy: ddp spawn labels Feb 16, 2023

awaelchli added this to the 2.0 milestone Feb 16, 2023

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Feb 16, 2023

awaelchli self-assigned this Feb 16, 2023

awaelchli and others added 12 commits February 16, 2023 04:51

fix

333f211

fix

a360f12

fixes

0189f1d

Merge branch 'master' into feature/ddp-default

262d72b

update

78ed33d

update

32b745f

changelog

7d933ae

strategies

72fa826

[pre-commit.ci] auto fixes from pre-commit.com hooks

7f227ea

for more information, see https://pre-commit.ci

update docs

9dbf88c

Merge remote-tracking branch 'origin/feature/ddp-default' into featur…

977ef03

…e/ddp-default

[pre-commit.ci] auto fixes from pre-commit.com hooks

25af4f7

for more information, see https://pre-commit.ci

awaelchli marked this pull request as ready for review February 16, 2023 11:20

awaelchli requested review from carmocca, justusschock, williamFalcon, edenlightning, lantiga and Borda as code owners February 16, 2023 11:20

carmocca approved these changes Feb 16, 2023

View reviewed changes

docs/source-pytorch/accelerators/gpu_intermediate.rst Show resolved Hide resolved

src/lightning/fabric/connector.py Show resolved Hide resolved

changelog

3de55b8

mergify bot added the has conflicts label Feb 17, 2023

justusschock approved these changes Feb 17, 2023

View reviewed changes

Merge branch 'master' into feature/ddp-default

8f6dc9f

Borda approved these changes Feb 20, 2023

View reviewed changes

src/lightning/pytorch/trainer/connectors/accelerator_connector.py Show resolved Hide resolved

Merge branch 'master' into feature/ddp-default

d988f24

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Feb 20, 2023

awaelchli enabled auto-merge (squash) February 20, 2023 09:58

awaelchli merged commit 81b7c30 into master Feb 20, 2023

awaelchli deleted the feature/ddp-default branch February 20, 2023 11:20

awaelchli mentioned this pull request Feb 21, 2023

RFC: Make subprocess DDP the default when selecting multiple devices #14075

Closed

awaelchli removed the strategy: ddp spawn label Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DDP subprocess the default launcher for multi-device #16780

Make DDP subprocess the default launcher for multi-device #16780

awaelchli commented Feb 16, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Feb 16, 2023 •

edited

Loading

codecov bot commented Feb 16, 2023 •

edited

Loading

Make DDP subprocess the default launcher for multi-device #16780

Make DDP subprocess the default launcher for multi-device #16780

Conversation

awaelchli commented Feb 16, 2023 • edited by github-actions bot Loading

What does this PR do?

github-actions bot commented Feb 16, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

codecov bot commented Feb 16, 2023 • edited Loading

Codecov Report

awaelchli commented Feb 16, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Feb 16, 2023 •

edited

Loading

codecov bot commented Feb 16, 2023 •

edited

Loading