Improve the suggested `num_workers` warning #18591

awaelchli · 2023-09-19T15:44:13Z

What does this PR do?

This PR improves the warning logic to suggest the number of workers to use in a DataLoader. Previously, the warning would be raised if the num_workers parameter in the DataLoader was smaller or equal to 2. This check is reasonable when running on a single GPU. However, the suggested value did not take into account the multi-GPU training use case.

The ideal choice of number of workers for a dataloader per rank is in the range 0 < num_workers < cpu_count // local_world_size. This PR also ensures we never recommend more workers than visible CPU cores (#15572).

Fixes #15572
Fixes #2196 (btw this is the second oldest issue we have open atm lol)

📚 Documentation preview 📚: https://pytorch-lightning--18591.org.readthedocs.build/en/18591/

cc @Borda @carmocca @justusschock @awaelchli

for more information, see https://pre-commit.ci

…e/num-workers

This reverts commit c1dccae.

for more information, see https://pre-commit.ci

github-actions · 2023-09-19T16:04:01Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅
pl-cpu (macOS-12, pytorch, 3.11, 2.0)	success	✅
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0)	success	✅
pl-cpu (windows-2022, pytorch, 3.11, 2.0)	success	✅

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/test_dataloaders.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
[pytorch-lightning (GPUs) (testing Lightning	latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175465&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e)	success
[pytorch-lightning (GPUs) (testing PyTorch	latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175465&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f)	success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/test_dataloaders.py, src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py.

🟢 pytorch_lightning: Benchmarks

Check ID	Status
lightning.Benchmarks	success	✅

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.

🟢 fabric: Docs

Check ID	Status
docs-make (fabric, doctest)	success	✅
docs-make (fabric, html)	success	✅

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, docs/source-fabric/api/utilities.rst.

🟢 pytorch_lightning: Docs

Check ID	Status
docs-make (pytorch, doctest)	success	✅
docs-make (pytorch, html)	success	✅

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
fabric-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
fabric-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 1.13)	success	✅
fabric-cpu (macOS-12, fabric, 3.11, 2.0)	success	✅
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0)	success	✅
fabric-cpu (windows-2022, fabric, 3.11, 2.0)	success	✅

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, tests/tests_fabric/utilities/test_data.py.

🟢 lightning_fabric: Azure GPU

Check ID	Status
[lightning-fabric (GPUs) (testing Fabric	latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175467&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f)	success
[lightning-fabric (GPUs) (testing Lightning	latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175467&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e)	success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, tests/tests_fabric/utilities/test_data.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.11)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.11)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.11)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.11)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.11)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.11)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.11)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.11)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.11)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.11)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.11)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.11)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.11)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.11)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.11)	success	✅

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

src/lightning/pytorch/trainer/connectors/data_connector.py

carmocca · 2023-09-19T18:16:02Z

src/lightning/pytorch/trainer/connectors/data_connector.py

-    elif dataloader.num_workers <= 2 < num_cpus and not using_spawn:
+    upper_bound = suggested_max_num_workers(trainer.num_devices)
+    if dataloader.num_workers <= 2 < upper_bound or dataloader.num_workers < 2 <= upper_bound:
+        # TODO
        # if changed, update the `filterwarnings` snippet in 'speed.html#num-workers'
        rank_zero_warn(


Do you think you could assert that the dataloader warning is not raised?

src/lightning/fabric/CHANGELOG.md

Co-authored-by: Carlos Mocholí <[email protected]>

…e/num-workers

for more information, see https://pre-commit.ci

…e/num-workers

stas00 · 2023-10-05T23:53:58Z

src/lightning/fabric/utilities/data.py

+    if local_world_size < 1:
+        raise ValueError(f"`local_world_size` should be >= 1, got {local_world_size}.")
+    cpu_count = _num_cpus_available()
+    return max(1, cpu_count // local_world_size)


You need at least one cpu-core per gpu-bound process, so I think would make a more sensible recommendation:

return max(1, (cpu_count // local_world_size)-1)

i.e. if you have 48 cpu-cores, and 8 gpus, you need at least 8 cores for the main processes, so only 40 remain then - so 40/8=5

and then there is the OS as well, which needs at least a core or 2 for its functioning.

stas00 · 2023-10-06T00:09:41Z

src/lightning/pytorch/trainer/connectors/data_connector.py

@@ -442,14 +442,11 @@ def _worker_check(dataloader: object, using_spawn: bool, name: str) -> None:
                "strategy=ddp_spawn and num_workers=0 may result in data loading bottlenecks."
                " Consider setting num_workers>0 and persistent_workers=True"
            )
-
-    elif dataloader.num_workers <= 2 < num_cpus and not using_spawn:
+    elif dataloader.num_workers <= 2 < upper_bound or dataloader.num_workers < 2 <= upper_bound:


the first part is pretty much certain to be True on the majority of setups I think, since the default in many frameworks is usually 2 workers and most modern machines have lots of cpu-cores for desktops, and any serious gpus nodes will have lots of cpu-cores.

Would you consider that num_workers == 2 is actually a reasonable setting?

If I read you correctly, you suggest simplifying the condition by dropping 2 < upper_bound because it's just true on most DL systems. I think I agree. So we would be left with only this check:

dataloader.num_workers <= 2
Now the question is, should this become a different upper limit. The history here is that when this was added back when PL was developed, the majority of users were doing computer vision and num workers > 2 was almost always mandatory for applying augmentations.

Today if we do small to medium size LLM training on small pre-tokenized datasets that fit in a machine, we're not going to require that many workers. So num workers=2 is not that unreasonable. We could consider dropping that to

dataloader.num_workers < 2
to emit the warning.

We are in agreement, Adrian

language models, especially where data is preprocessed ahead of time, ala Megatron-LM/nemo - 2 workers is the norm from my experience

vision models is a different story, if the amount of transforms is huge it can be very slow and yes warrant more workers

Ideally the general solution would be to provide users with a breakdown of timing for [DL][fwd+bwd][logging] time spans, so that they can see if their DL is a bottleneck. I include logging as well since it can also be an issue if one does some blocking logging to a slow or remote IO.

Now you're switching to measuring the real impact of num_workers to the efficiency of the compute, as compared to the guesswork that is implemented now.

So for example you could warn a user if you see that dl_time > 10% of compute time or something like that. I don't have a good general recommendation on threshold % since LLM/VLM workload could be very different to a 1 gpu workload. But the principle is the same - compute is the most expensive resource, especially if it's rented per hour, so now the user wants to detect and remove the bottlenecks to pay less and of course have an earlier finish line.

Now, with the raise of the cloud storage there is a new problem emerging and that is of the data not even being present on the compute node at the time of compute. So not only DL might need to do transforms, it also needs to fetch remote data from the cloud. Here 2 workers are almost never enough. During IDEFICS-80B training we had this issue, but we couldn't raise the number of workers not because we didn't have the spare cores, but because WebDataset which we were using for Just-in-time prefetch lead to huge processes so that the 2x8 workers were consuming a lion part of 1+TB of CPU memory and so with even 3 workers we would get cgroups killing the training.

I hope that WebDataset and other streaming DL solutions will get better over time, but this was a very painful experience 6 months ago, since we did want more workers, but couldn't have them.
I think H100 nodes are coming with at least 2TB of CPU in some clouds so this should help. But then H100s run 2-6 faster than A100s, so one needs to feed the fire even faster, which means more workers will be needed! and so the rule of at least 2 might no longer apply again.

In other words load-based heuristics (like I proposed above) are needed to really help the users to optimize their setup and one-fit-all guessing heuristics will break even more so.

awaelchli added 9 commits September 19, 2023 05:22

add suggested num workers

2ca329e

wip

2db3680

update

c5c5de2

test

ada09ed

update

b6dfaea

test

4dff113

format

63628f4

docs

84f6dea

x

6daa8f3

awaelchli added feature Is an improvement or enhancement performance labels Sep 19, 2023

awaelchli added this to the 2.1 milestone Sep 19, 2023

github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 19, 2023

pre-commit-ci bot and others added 6 commits September 19, 2023 15:45

[pre-commit.ci] auto fixes from pre-commit.com hooks

d745a76

for more information, see https://pre-commit.ci

changelog

120c353

Merge remote-tracking branch 'origin/feature/num-workers' into featur…

0467092

…e/num-workers

circular import

c1dccae

Revert "circular import"

dca5162

This reverts commit c1dccae.

refactor

79267e4

awaelchli force-pushed the feature/num-workers branch from c0ce86f to 79267e4 Compare September 19, 2023 15:59

github-actions bot added the fabric lightning.fabric.Fabric label Sep 19, 2023

awaelchli added 2 commits September 19, 2023 18:02

update

9228d64

chlog

8055027

awaelchli marked this pull request as ready for review September 19, 2023 16:03

awaelchli requested review from carmocca, justusschock, Borda and williamFalcon as code owners September 19, 2023 16:03

[pre-commit.ci] auto fixes from pre-commit.com hooks

22c1d25

for more information, see https://pre-commit.ci

fix test

388e25a

carmocca reviewed Sep 19, 2023

View reviewed changes

awaelchli and others added 9 commits September 19, 2023 15:53

Update src/lightning/fabric/CHANGELOG.md

d35cb67

Co-authored-by: Carlos Mocholí <[email protected]>

Merge remote-tracking branch 'origin/feature/num-workers' into featur…

311348b

…e/num-workers

keep the spawn warnings for now, will remove in separate PR

47258a8

typo

747b62b

add requested test

bbf7782

add comment

8462322

extend test

6b7066d

test fixes

b1a451c

Merge branch 'master' into feature/num-workers

32797cd

mergify bot removed the has conflicts label Sep 21, 2023

pre-commit-ci bot and others added 3 commits September 21, 2023 12:25

[pre-commit.ci] auto fixes from pre-commit.com hooks

49eef42

for more information, see https://pre-commit.ci

add utility to api docs

acbdb01

Merge remote-tracking branch 'origin/feature/num-workers' into featur…

be9b46c

…e/num-workers

awaelchli requested review from edenlightning and lantiga as code owners September 21, 2023 12:26

awaelchli requested a review from carmocca September 21, 2023 12:28

fix test

0a9dd28

carmocca approved these changes Sep 21, 2023

View reviewed changes

Borda approved these changes Sep 21, 2023

View reviewed changes

mergify bot added the ready PRs ready to be merged label Sep 21, 2023

awaelchli merged commit 57f5268 into master Sep 21, 2023
114 checks passed

awaelchli deleted the feature/num-workers branch September 21, 2023 13:38

This was referenced Sep 21, 2023

Remove outdated num_workers warnings #18610

Closed

Update persistent_workers recommendation when using spawn launcher #18649

Merged

awaelchli mentioned this pull request Oct 5, 2023

Troublesome recommendation on num_workers #18723

Closed

stas00 reviewed Oct 6, 2023

View reviewed changes

awaelchli mentioned this pull request Oct 6, 2023

Refinements to the num-workers warning #18737

Merged

awaelchli mentioned this pull request Nov 23, 2023

Move _worker_check() logic to process_dataloader() in DDPSpawnStrategy class. #12216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the suggested `num_workers` warning #18591

Improve the suggested `num_workers` warning #18591

awaelchli commented Sep 19, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Sep 19, 2023 •

edited

Loading

carmocca Sep 19, 2023

stas00 Oct 5, 2023 •

edited

Loading

stas00 Oct 6, 2023

awaelchli Oct 6, 2023 •

edited

Loading

stas00 Oct 6, 2023 •

edited

Loading

Improve the suggested num_workers warning #18591

Improve the suggested num_workers warning #18591

Conversation

awaelchli commented Sep 19, 2023 • edited by github-actions bot Loading

What does this PR do?

github-actions bot commented Sep 19, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

carmocca Sep 19, 2023

Choose a reason for hiding this comment

stas00 Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

stas00 Oct 6, 2023

Choose a reason for hiding this comment

awaelchli Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

stas00 Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

Improve the suggested `num_workers` warning #18591

Improve the suggested `num_workers` warning #18591

awaelchli commented Sep 19, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Sep 19, 2023 •

edited

Loading

stas00 Oct 5, 2023 •

edited

Loading

awaelchli Oct 6, 2023 •

edited

Loading

stas00 Oct 6, 2023 •

edited

Loading