-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DDP subprocess the default launcher for multi-device #16780
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
⚡ Required checks status: All passing 🟢Groups summary🟢 pytorch_lightning: Tests workflow
These checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Azure HPU
These checks are required after the changes to 🟢 pytorch_lightning: Azure IPU
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 lightning_fabric: CPU workflow
These checks are required after the changes to 🟢 lightning_fabric: Azure GPU
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to 🟢 link-check
These checks are required after the changes to Thank you for your contribution! 💜
|
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #16780 +/- ##
=========================================
- Coverage 82% 59% -22%
=========================================
Files 437 412 -25
Lines 31583 31280 -303
=========================================
- Hits 25784 18534 -7250
- Misses 5799 12746 +6947 |
What does this PR do?
When using multiple devices, the strategy now defaults to "ddp" instead of "ddp_spawn" when none is set.
Before:
Now:
DDP-spawn is good for debugging and testing, but it has "invisible" process boundaries and especially beginner users don't know that. Since the processes join after
Trainer.fit()
etc., the state is not updated (only the model params are transferred), and this can lead to unexpected behavior. Very early on in Lightnings life, spawn also used to work in notebooks which was the primary reason why it was chosen as default. But support for that was later dropped in PyTorch and we introduced ddp-fork as an alternative later on.Over time we started discouraging ddp-spawn in docs, tutorials etc. and always promoted explicitly setting strategy=ddp. With Lightning 2.0, we are now ready to switch fully over to ddp as the default multi-device strategy for small to medium sized models.
cc @Borda @justusschock @carmocca @awaelchli