-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create barrier without timeout in prepare_data()
#19448
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ghtning into feature/infinite-barrier
for more information, see https://pre-commit.ci
⚡ Required checks status: All passing 🟢Groups summary🟢 pytorch_lightning: Tests workflow
These checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Benchmarks
These checks are required after the changes to 🟢 fabric: Docs
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 lightning_fabric: CPU workflowThese checks are required after the changes to 🟢 lightning_fabric: Azure GPU
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to Thank you for your contribution! 💜
|
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #19448 +/- ##
==========================================
- Coverage 83% 53% -30%
==========================================
Files 452 446 -6
Lines 38136 37967 -169
==========================================
- Hits 31784 20268 -11516
- Misses 6352 17699 +11347 |
What does this PR do?
Fixes #19266
LightningModule and LightningDataModule have a hook
prepare_data()
that can be used to run preprocessing and data downloads in case of multiprocessing/multi-GPU, where the hook only runs on local rank 0 to avoid racing conditions etc. A long standing issue has been that this hook is subject to the collective timeout setting by the world process group (30 minutes by default). If you processing code takes longer than 30 minutes to complete, you would not be able to use theprepare_data
mechanism. The equivalent in Fabric is theFabric.rank_zero_first()
context manager, which has the same problem.This PR introduces an "infinite" barrier that will not time out and is used exclusively around the
prepare_data()
hook (andrank_zero_first()
in Fabric).What the Trainer did before:
What it does now:
I have verified this works in multi-node jobs with Lightning Studio by taking a standard trainer example and implementing
prepare_data()
with a 40 minute sleep on rank 0. Using the main branch, we see that the jobs timeout and fail after ~30 minutes:Whereas with the implementation in this branch, the 40 minute sleep finishes, processes meet at the barrier, and training starts:
📚 Documentation preview 📚: https://pytorch-lightning--19448.org.readthedocs.build/en/19448/
cc @Borda @awaelchli @carmocca @justusschock