Create barrier without timeout in `prepare_data()` #19448

awaelchli · 2024-02-12T02:17:00Z

What does this PR do?

Fixes #19266

LightningModule and LightningDataModule have a hook prepare_data() that can be used to run preprocessing and data downloads in case of multiprocessing/multi-GPU, where the hook only runs on local rank 0 to avoid racing conditions etc. A long standing issue has been that this hook is subject to the collective timeout setting by the world process group (30 minutes by default). If you processing code takes longer than 30 minutes to complete, you would not be able to use the prepare_data mechanism. The equivalent in Fabric is the Fabric.rank_zero_first() context manager, which has the same problem.

This PR introduces an "infinite" barrier that will not time out and is used exclusively around the prepare_data() hook (and rank_zero_first() in Fabric).

What the Trainer did before:

if trainer.local_rank == 0:
    datamodule.prepare_data()
barrier()  # the normal barrier has a 30 min timeout

What it does now:

with _InfiniteBarrier():
    if trainer.local_rank == 0:
        datamodule.prepare_data()
# < ------- barrier at end of this context without timeout

I have verified this works in multi-node jobs with Lightning Studio by taking a standard trainer example and implementing prepare_data() with a 40 minute sleep on rank 0. Using the main branch, we see that the jobs timeout and fail after ~30 minutes:

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800420 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800420 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800420 milliseconds before timing out.
Aborted (core dumped)

Whereas with the implementation in this branch, the 40 minute sleep finishes, processes meet at the barrier, and training starts:

📚 Documentation preview 📚: https://pytorch-lightning--19448.org.readthedocs.build/en/19448/

cc @Borda @awaelchli @carmocca @justusschock

for more information, see https://pre-commit.ci

…ghtning into feature/infinite-barrier

for more information, see https://pre-commit.ci

github-actions · 2024-02-12T13:43:39Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.13, oldest)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.1)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.13, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.13, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.1)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅
pl-cpu (macOS-12, pytorch, 3.11, 2.0)	success	✅
pl-cpu (macOS-12, pytorch, 3.11, 2.1)	success	✅
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0)	success	✅
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.1)	success	✅
pl-cpu (windows-2022, pytorch, 3.11, 2.0)	success	✅
pl-cpu (windows-2022, pytorch, 3.11, 2.1)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/trainer/connectors/data_connector.py, tests/tests_pytorch/core/test_datamodules.py, tests/tests_pytorch/models/test_hooks.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs) (testing Lightning \| latest)	success	✅
pytorch-lightning (GPUs) (testing PyTorch \| latest)	success	✅

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/data_connector.py, tests/tests_pytorch/core/test_datamodules.py, tests/tests_pytorch/models/test_hooks.py, src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py.

🟢 pytorch_lightning: Benchmarks

Check ID	Status
lightning.Benchmarks	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/trainer/connectors/data_connector.py.

🟢 fabric: Docs

Check ID	Status
docs-make (fabric, doctest)	success	✅
docs-make (fabric, html)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py.

🟢 pytorch_lightning: Docs

Check ID	Status
docs-make (pytorch, doctest)	success	✅
docs-make (pytorch, html)	success	✅

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/data_connector.py.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 1.13, oldest)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
fabric-cpu (macOS-11, lightning, 3.11, 2.1)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.13, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.1)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.13, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
fabric-cpu (windows-2022, lightning, 3.11, 2.1)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 1.13)	success	✅
fabric-cpu (macOS-12, fabric, 3.11, 2.0)	success	✅
fabric-cpu (macOS-12, fabric, 3.11, 2.1)	success	✅
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0)	success	✅
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.1)	success	✅
fabric-cpu (windows-2022, fabric, 3.11, 2.0)	success	✅
fabric-cpu (windows-2022, fabric, 3.11, 2.1)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py, tests/tests_fabric/test_fabric.py, tests/tests_fabric/utilities/test_distributed.py.

🟢 lightning_fabric: Azure GPU

Check ID	Status
lightning-fabric (GPUs) (testing Fabric \| latest)	success	✅
lightning-fabric (GPUs) (testing Lightning \| latest)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py, tests/tests_fabric/test_fabric.py, tests/tests_fabric/utilities/test_distributed.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/trainer/connectors/data_connector.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.11)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.11)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.11)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.11)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.11)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.11)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.11)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.11)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.11)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.11)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.11)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.11)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.11)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.11)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.11)	success	✅

These checks are required after the changes to src/lightning/fabric/fabric.py, src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/trainer/connectors/data_connector.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

codecov · 2024-02-12T14:37:31Z

Codecov Report

Merging #19448 (e27d7f1) into master (2ed7282) will decrease coverage by 30%.
Report is 4 commits behind head on master.
The diff coverage is 100%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #19448      +/-   ##
==========================================
- Coverage      83%      53%     -30%     
==========================================
  Files         452      446       -6     
  Lines       38136    37967     -169     
==========================================
- Hits        31784    20268   -11516     
- Misses       6352    17699   +11347

for more information, see https://pre-commit.ci

src/lightning/fabric/utilities/distributed.py

awaelchli added 2 commits February 12, 2024 02:43

rank zero first

662919d

context

6313e4f

awaelchli added fun Staff contributions outside working hours - to differentiate from the "community" label feature Is an improvement or enhancement labels Feb 12, 2024

awaelchli added this to the 2.3 milestone Feb 12, 2024

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Feb 12, 2024

awaelchli added distributed Generic distributed-related topic and removed fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Feb 12, 2024

awaelchli added 2 commits February 12, 2024 03:21

docstring

c53f952

refactor

1df9c69

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Feb 12, 2024

pre-commit-ci bot and others added 15 commits February 12, 2024 02:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

036b499

for more information, see https://pre-commit.ci

ctx manager

60fc068

[pre-commit.ci] auto fixes from pre-commit.com hooks

9115289

for more information, see https://pre-commit.ci

comment

3597de1

Merge branch 'feature/infinite-barrier' of github.com:Lightning-AI/li…

80feb90

…ghtning into feature/infinite-barrier

chlog

4a336a3

test

9e26e99

[pre-commit.ci] auto fixes from pre-commit.com hooks

4f73ad2

for more information, see https://pre-commit.ci

optimize

9e13327

line too long

c628d7c

remove redundant comments

869a03d

update test

f6c69c0

fix test

c7ef0be

precommit

a14a591

fix test

9429278

awaelchli marked this pull request as ready for review February 12, 2024 13:43

awaelchli requested review from carmocca, justusschock, Borda and williamFalcon as code owners February 12, 2024 13:43

awaelchli and others added 3 commits February 12, 2024 17:42

Merge branch 'master' into feature/infinite-barrier

6bb1ab3

update test

ba99259

[pre-commit.ci] auto fixes from pre-commit.com hooks

f88ae48

for more information, see https://pre-commit.ci

carmocca approved these changes Feb 12, 2024

View reviewed changes

src/lightning/fabric/utilities/distributed.py Show resolved Hide resolved

formatting

e27d7f1

justusschock approved these changes Feb 13, 2024

View reviewed changes

mergify bot added the ready PRs ready to be merged label Feb 13, 2024

Borda approved these changes Feb 13, 2024

View reviewed changes

carmocca merged commit 3c5a465 into master Feb 13, 2024
113 checks passed

carmocca deleted the feature/infinite-barrier branch February 13, 2024 11:10

jojje added a commit to jojje/pytorch-lightning that referenced this pull request Mar 5, 2024

Update changelog for Lightning-AI#19448

9ce6651

edyoshikun mentioned this pull request Mar 9, 2024

upgrading lightning to >2.0.8 results to trainer issues mehta-lab/VisCy#72

Closed

awaelchli mentioned this pull request Jun 10, 2024

Release 2.3.0 #19954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create barrier without timeout in `prepare_data()` #19448

Create barrier without timeout in `prepare_data()` #19448

awaelchli commented Feb 12, 2024 •

edited

Loading

github-actions bot commented Feb 12, 2024 •

edited

Loading

codecov bot commented Feb 12, 2024 •

edited

Loading

Create barrier without timeout in prepare_data() #19448

Create barrier without timeout in prepare_data() #19448

Conversation

awaelchli commented Feb 12, 2024 • edited Loading

What does this PR do?

github-actions bot commented Feb 12, 2024 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

codecov bot commented Feb 12, 2024 • edited Loading

Codecov Report

Create barrier without timeout in `prepare_data()` #19448

Create barrier without timeout in `prepare_data()` #19448

awaelchli commented Feb 12, 2024 •

edited

Loading

github-actions bot commented Feb 12, 2024 •

edited

Loading

codecov bot commented Feb 12, 2024 •

edited

Loading