-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the suggested num_workers
warning
#18591
Conversation
for more information, see https://pre-commit.ci
This reverts commit c1dccae.
c0ce86f
to
79267e4
Compare
for more information, see https://pre-commit.ci
⚡ Required checks status: All passing 🟢Groups summary🟢 pytorch_lightning: Tests workflowThese checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Benchmarks
These checks are required after the changes to 🟢 fabric: Docs
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 lightning_fabric: CPU workflowThese checks are required after the changes to 🟢 lightning_fabric: Azure GPU
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to Thank you for your contribution! 💜
|
elif dataloader.num_workers <= 2 < num_cpus and not using_spawn: | ||
upper_bound = suggested_max_num_workers(trainer.num_devices) | ||
if dataloader.num_workers <= 2 < upper_bound or dataloader.num_workers < 2 <= upper_bound: | ||
# TODO | ||
# if changed, update the `filterwarnings` snippet in 'speed.html#num-workers' | ||
rank_zero_warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could assert that the dataloader warning is not raised?
Co-authored-by: Carlos Mocholí <[email protected]>
for more information, see https://pre-commit.ci
if local_world_size < 1: | ||
raise ValueError(f"`local_world_size` should be >= 1, got {local_world_size}.") | ||
cpu_count = _num_cpus_available() | ||
return max(1, cpu_count // local_world_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need at least one cpu-core per gpu-bound process, so I think would make a more sensible recommendation:
return max(1, (cpu_count // local_world_size)-1)
i.e. if you have 48 cpu-cores, and 8 gpus, you need at least 8 cores for the main processes, so only 40 remain then - so 40/8=5
and then there is the OS as well, which needs at least a core or 2 for its functioning.
@@ -442,14 +442,11 @@ def _worker_check(dataloader: object, using_spawn: bool, name: str) -> None: | |||
"strategy=ddp_spawn and num_workers=0 may result in data loading bottlenecks." | |||
" Consider setting num_workers>0 and persistent_workers=True" | |||
) | |||
|
|||
elif dataloader.num_workers <= 2 < num_cpus and not using_spawn: | |||
elif dataloader.num_workers <= 2 < upper_bound or dataloader.num_workers < 2 <= upper_bound: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first part is pretty much certain to be True on the majority of setups I think, since the default in many frameworks is usually 2 workers and most modern machines have lots of cpu-cores for desktops, and any serious gpus nodes will have lots of cpu-cores.
Would you consider that num_workers == 2
is actually a reasonable setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read you correctly, you suggest simplifying the condition by dropping 2 < upper_bound
because it's just true on most DL systems. I think I agree. So we would be left with only this check:
dataloader.num_workers <= 2
Now the question is, should this become a different upper limit. The history here is that when this was added back when PL was developed, the majority of users were doing computer vision and num workers > 2 was almost always mandatory for applying augmentations.
Today if we do small to medium size LLM training on small pre-tokenized datasets that fit in a machine, we're not going to require that many workers. So num workers=2 is not that unreasonable. We could consider dropping that to
dataloader.num_workers < 2
to emit the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in agreement, Adrian
- language models, especially where data is preprocessed ahead of time, ala Megatron-LM/nemo - 2 workers is the norm from my experience
- vision models is a different story, if the amount of transforms is huge it can be very slow and yes warrant more workers
Ideally the general solution would be to provide users with a breakdown of timing for [DL][fwd+bwd][logging]
time spans, so that they can see if their DL is a bottleneck. I include logging as well since it can also be an issue if one does some blocking logging to a slow or remote IO.
Now you're switching to measuring the real impact of num_workers to the efficiency of the compute, as compared to the guesswork that is implemented now.
So for example you could warn a user if you see that dl_time > 10% of compute time or something like that. I don't have a good general recommendation on threshold % since LLM/VLM workload could be very different to a 1 gpu workload. But the principle is the same - compute is the most expensive resource, especially if it's rented per hour, so now the user wants to detect and remove the bottlenecks to pay less and of course have an earlier finish line.
Now, with the raise of the cloud storage there is a new problem emerging and that is of the data not even being present on the compute node at the time of compute. So not only DL might need to do transforms, it also needs to fetch remote data from the cloud. Here 2 workers are almost never enough. During IDEFICS-80B training we had this issue, but we couldn't raise the number of workers not because we didn't have the spare cores, but because WebDataset
which we were using for Just-in-time prefetch lead to huge processes so that the 2x8 workers were consuming a lion part of 1+TB of CPU memory and so with even 3 workers we would get cgroups killing the training.
I hope that WebDataset and other streaming DL solutions will get better over time, but this was a very painful experience 6 months ago, since we did want more workers, but couldn't have them.
I think H100 nodes are coming with at least 2TB of CPU in some clouds so this should help. But then H100s run 2-6 faster than A100s, so one needs to feed the fire even faster, which means more workers will be needed! and so the rule of at least 2 might no longer apply again.
In other words load-based heuristics (like I proposed above) are needed to really help the users to optimize their setup and one-fit-all guessing heuristics will break even more so.
What does this PR do?
This PR improves the warning logic to suggest the number of workers to use in a DataLoader. Previously, the warning would be raised if the
num_workers
parameter in the DataLoader was smaller or equal to 2. This check is reasonable when running on a single GPU. However, the suggested value did not take into account the multi-GPU training use case.The ideal choice of number of workers for a dataloader per rank is in the range 0 < num_workers < cpu_count // local_world_size. This PR also ensures we never recommend more workers than visible CPU cores (#15572).
Fixes #15572
Fixes #2196 (btw this is the second oldest issue we have open atm lol)
📚 Documentation preview 📚: https://pytorch-lightning--18591.org.readthedocs.build/en/18591/
cc @Borda @carmocca @justusschock @awaelchli