-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set an upper limit on CPU threads in distributed training #18677
Conversation
for more information, see https://pre-commit.ci
…ature/set-num-threads
for more information, see https://pre-commit.ci
…ature/set-num-threads
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
⚡ Required checks status: All passing 🟢Groups summary🟢 pytorch_lightning: Tests workflow
These checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Benchmarks
These checks are required after the changes to 🟢 fabric: Docs
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 lightning_fabric: CPU workflowThese checks are required after the changes to 🟢 lightning_fabric: Azure GPU
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to Thank you for your contribution! 💜
|
have you tried some extensive data preprocessing because this seems as you hitting IO bottleneck |
for more information, see https://pre-commit.ci
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #18677 +/- ##
=========================================
- Coverage 84% 55% -29%
=========================================
Files 428 423 -5
Lines 33550 33478 -72
=========================================
- Hits 28070 18329 -9741
- Misses 5480 15149 +9669 |
@Borda I don't see how I'm hitting an IO bottleneck. The number of dataloading workers are the same between the comparison experiments, and so the IO load is identical. The only thing that changes between these runs is the number of threads used for the intra-op parallelism. The experiment simply shows that with choosing too many threads running in parallel and competing with each other, the effective throughput degrades. |
I do not say that your comparison is wrong or unfair, I just say that in case that when your preprocessing would take much longer for example image augmentation runs very often on CPU, then you can see some advantage of using small number of workers... the bottleneck IO I meant for example your IO speed os 10MB/s, then this speed is for all workers not each so it make sense when you divide 10MB/s per 5 workers they will be slover because each would have at most 2MB/s munis some switching overhead... |
What does this PR do?
Fixes #16737
Aligns our default launcher closer with torchrun's settings regarding the number of threads:
https://github.com/pytorch/pytorch/blob/e0be9ebc181fbf7b2ed4e641cd24a0dadd063f27/torch/distributed/run.py#L702-L714
Benchmarks
All experiments on 8xA100 40GB machine, 256 CPU cores, PyTorch 2.2.0.dev20230920+cu121 nightly.
Computer Vision Example
Single GPU
DDP with 2 processes/GPUs
Transformer Example
DDP with 2 processes/GPUs
DDP with 6 processes/GPUs
Discussion
In cases where CPU load is high like in the CV example where images need to be resized, the system can be overloaded if too many threads are launched. By restricting the upper bound on number of threads per GPU to num-cpus / num-procs, we ensure that the total number of threads across all processes in a machine is not exceedingly larger than the number of available cores on the system. This setting won't affect tasks that are CPU-light, as seen in the transformer example above. Torchrun/elastic sets the limit to 1 thread per DDP process, but warns that the value should be tuned. We could do the same, but in this PR we opt for a higher number and in return don't emit warnings.
cc @carmocca @justusschock @awaelchli @Borda