Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the suggested num_workers warning #18591

Merged
merged 32 commits into from
Sep 21, 2023
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Sep 19, 2023

What does this PR do?

This PR improves the warning logic to suggest the number of workers to use in a DataLoader. Previously, the warning would be raised if the num_workers parameter in the DataLoader was smaller or equal to 2. This check is reasonable when running on a single GPU. However, the suggested value did not take into account the multi-GPU training use case.

The ideal choice of number of workers for a dataloader per rank is in the range 0 < num_workers < cpu_count // local_world_size. This PR also ensures we never recommend more workers than visible CPU cores (#15572).

Fixes #15572
Fixes #2196 (btw this is the second oldest issue we have open atm lol)


📚 Documentation preview 📚: https://pytorch-lightning--18591.org.readthedocs.build/en/18591/

cc @Borda @carmocca @justusschock @awaelchli

@awaelchli awaelchli added feature Is an improvement or enhancement performance labels Sep 19, 2023
@awaelchli awaelchli added this to the 2.1 milestone Sep 19, 2023
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 19, 2023
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Sep 19, 2023
@awaelchli awaelchli marked this pull request as ready for review September 19, 2023 16:03
@github-actions
Copy link
Contributor

github-actions bot commented Sep 19, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.8, 1.11) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success
pl-cpu (macOS-12, pytorch, 3.11, 2.0) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0) success
pl-cpu (windows-2022, pytorch, 3.11, 2.0) success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/test_dataloaders.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
[pytorch-lightning (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175465&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success
[pytorch-lightning (GPUs) (testing PyTorch latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175465&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/test_dataloaders.py, src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.

🟢 fabric: Docs
Check ID Status
docs-make (fabric, doctest) success
docs-make (fabric, html) success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, docs/source-fabric/api/utilities.rst.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.10, 2.0) success
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11) success
fabric-cpu (windows-2022, lightning, 3.9, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.10, 2.0) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success
fabric-cpu (macOS-12, fabric, 3.11, 2.0) success
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0) success
fabric-cpu (windows-2022, fabric, 3.11, 2.0) success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, tests/tests_fabric/utilities/test_data.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
[lightning-fabric (GPUs) (testing Fabric latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175467&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success
[lightning-fabric (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=175467&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, tests/tests_fabric/utilities/test_data.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/fabric/utilities/__init__.py, src/lightning/fabric/utilities/data.py, src/lightning/pytorch/trainer/connectors/data_connector.py, src/lightning/pytorch/utilities/__init__.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

src/lightning/pytorch/trainer/connectors/data_connector.py Outdated Show resolved Hide resolved
elif dataloader.num_workers <= 2 < num_cpus and not using_spawn:
upper_bound = suggested_max_num_workers(trainer.num_devices)
if dataloader.num_workers <= 2 < upper_bound or dataloader.num_workers < 2 <= upper_bound:
# TODO
# if changed, update the `filterwarnings` snippet in 'speed.html#num-workers'
rank_zero_warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could assert that the dataloader warning is not raised?

src/lightning/fabric/CHANGELOG.md Outdated Show resolved Hide resolved
@mergify mergify bot removed the has conflicts label Sep 21, 2023
@mergify mergify bot added the ready PRs ready to be merged label Sep 21, 2023
@awaelchli awaelchli merged commit 57f5268 into master Sep 21, 2023
114 checks passed
@awaelchli awaelchli deleted the feature/num-workers branch September 21, 2023 13:38
if local_world_size < 1:
raise ValueError(f"`local_world_size` should be >= 1, got {local_world_size}.")
cpu_count = _num_cpus_available()
return max(1, cpu_count // local_world_size)
Copy link
Contributor

@stas00 stas00 Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need at least one cpu-core per gpu-bound process, so I think would make a more sensible recommendation:

return max(1, (cpu_count // local_world_size)-1)

i.e. if you have 48 cpu-cores, and 8 gpus, you need at least 8 cores for the main processes, so only 40 remain then - so 40/8=5

and then there is the OS as well, which needs at least a core or 2 for its functioning.

@@ -442,14 +442,11 @@ def _worker_check(dataloader: object, using_spawn: bool, name: str) -> None:
"strategy=ddp_spawn and num_workers=0 may result in data loading bottlenecks."
" Consider setting num_workers>0 and persistent_workers=True"
)

elif dataloader.num_workers <= 2 < num_cpus and not using_spawn:
elif dataloader.num_workers <= 2 < upper_bound or dataloader.num_workers < 2 <= upper_bound:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first part is pretty much certain to be True on the majority of setups I think, since the default in many frameworks is usually 2 workers and most modern machines have lots of cpu-cores for desktops, and any serious gpus nodes will have lots of cpu-cores.

Would you consider that num_workers == 2 is actually a reasonable setting?

Copy link
Contributor Author

@awaelchli awaelchli Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read you correctly, you suggest simplifying the condition by dropping 2 < upper_bound because it's just true on most DL systems. I think I agree. So we would be left with only this check:

dataloader.num_workers <= 2
Now the question is, should this become a different upper limit. The history here is that when this was added back when PL was developed, the majority of users were doing computer vision and num workers > 2 was almost always mandatory for applying augmentations.

Today if we do small to medium size LLM training on small pre-tokenized datasets that fit in a machine, we're not going to require that many workers. So num workers=2 is not that unreasonable. We could consider dropping that to

dataloader.num_workers < 2
to emit the warning.

Copy link
Contributor

@stas00 stas00 Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in agreement, Adrian

  • language models, especially where data is preprocessed ahead of time, ala Megatron-LM/nemo - 2 workers is the norm from my experience
  • vision models is a different story, if the amount of transforms is huge it can be very slow and yes warrant more workers

Ideally the general solution would be to provide users with a breakdown of timing for [DL][fwd+bwd][logging] time spans, so that they can see if their DL is a bottleneck. I include logging as well since it can also be an issue if one does some blocking logging to a slow or remote IO.

Now you're switching to measuring the real impact of num_workers to the efficiency of the compute, as compared to the guesswork that is implemented now.

So for example you could warn a user if you see that dl_time > 10% of compute time or something like that. I don't have a good general recommendation on threshold % since LLM/VLM workload could be very different to a 1 gpu workload. But the principle is the same - compute is the most expensive resource, especially if it's rented per hour, so now the user wants to detect and remove the bottlenecks to pay less and of course have an earlier finish line.


Now, with the raise of the cloud storage there is a new problem emerging and that is of the data not even being present on the compute node at the time of compute. So not only DL might need to do transforms, it also needs to fetch remote data from the cloud. Here 2 workers are almost never enough. During IDEFICS-80B training we had this issue, but we couldn't raise the number of workers not because we didn't have the spare cores, but because WebDataset which we were using for Just-in-time prefetch lead to huge processes so that the 2x8 workers were consuming a lion part of 1+TB of CPU memory and so with even 3 workers we would get cgroups killing the training.

I hope that WebDataset and other streaming DL solutions will get better over time, but this was a very painful experience 6 months ago, since we did want more workers, but couldn't have them.
I think H100 nodes are coming with at least 2TB of CPU in some clouds so this should help. But then H100s run 2-6 faster than A100s, so one needs to feed the fire even faster, which means more workers will be needed! and so the rule of at least 2 might no longer apply again.

In other words load-based heuristics (like I proposed above) are needed to really help the users to optimize their setup and one-fit-all guessing heuristics will break even more so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric feature Is an improvement or enhancement performance pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect num_workers warning Let's add a suggested_num_workers() method?
4 participants