-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Horovod distributed backend #1529
Conversation
Hello @tgaddair! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-04-22 21:21:47 UTC |
@@ -219,6 +220,13 @@ def set_distributed_mode(self, distributed_backend): | |||
self.use_ddp = True | |||
self.data_parallel_device_ids = None | |||
self.on_gpu = False | |||
elif distributed_backend == 'horovod': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to be transparent to the user.
can we automate setting this? this way the abstraction doesn’t bleed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the mpirun thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand you correctly: is the idea that when running via horovodrun
or mpirun
, if the user has not specified distributed_backend
, then we will automatically set distributed_backend='horovod'
here?
We could certainly do that when running with horovodrun
+ our Gloo backend, as we have special environment variables we can check (HOROVOD_RANK
for example). Doing so with mpirun
is more tricky, because different MPI implementations have different environment variables. Also, in the future, there might be another distributed backend other than Horovod that uses MPI.
So maybe we could automate it for horovodrun
but still require them to set it explicitly for mpirun
? (Let me know if I misunderstood your suggestion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand you correctly: is the idea that when running via horovodrun or mpirun, if the user has not specified distributed_backend, then we will automatically set distributed_backend='horovod' here?
Yes!
So maybe we could automate it for horovodrun but still require them to set it explicitly for mpirun? (Let me know if I misunderstood your suggestion).
Let's do this for now (v1) and for v2 maybe we set it explicitely for mpirun? i just don't know enough about mpirun yet, but if mpirun can run any backend then the user should be forced to set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I added a has_horovodrun()
check in distrib_data_parallel.py
that checks for Gloo or OpenMPI environment variables set by horovodrun
. Also added a test. Let me know if that aligns with what you were thinking.
This pull request is now in conflict... :( |
@tgaddair i love this! wondering if we can automate the comment I added so the user can use horovod without remembering anything other than turning on the flag |
This pull request is now in conflict... :( |
1 similar comment
This pull request is now in conflict... :( |
set_proc_rank(self.proc_rank) | ||
|
||
if hvd.rank() != 0: | ||
self.logger = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed, loggers shall have already rank_ero_only
in such case you #1408 and setting global rank
https://github.com/PyTorchLightning/pytorch-lightning/blob/a22a8142ac65668781a6e6f76d3c4e55ea7c249a/pytorch_lightning/trainer/distrib_parts.py#L494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The errors come from a race condition where different ranks will attempt to mkdir
the same directory, leading to an exception being raised on one of the workers. For example, this can happen when creating a SummaryWriter, which is why in Horovod we only do so on rank 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in lightning we already handle setting loggers, etc only to rank=0 btw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I updated to set the logger ranks to hvd.rank()
instead of deleting them outside of rank 0. Let me know if that makes more sense.
parser.add_argument('--trainer-options', required=True) | ||
|
||
|
||
def test(trainer_options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed for clarity and added a docstring at the top of the file to explain usage.
@@ -0,0 +1,36 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to be a (unit)test because by the name it won't be found
why there is data/horovod/
would it rather be tests/models/script_train_horovod.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a docstring at the top for clarity. This script is meant to be executed from test_horovod.py
. Reason for this is to test driving the training via horovodrun
using multiple parallel worker processes.
|
||
# Horovod: wrap optimizers to perform gradient aggregation via allreduce | ||
self.optimizers = [ | ||
hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when i do:
def configure_optimizers(self):
return Adam(self.generator.parameters(), Adam(self.discriminator.parameters())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this line here break?
model.named_parameters()
Should we not instead do:
[hvd.DistributedOptimizer(optimizer, named_parameters=opt.named_parameters()) for opt in self.optimizers]
This might be a silly question as I don't know the details of DistributedOptimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch! This was an oversight on my part. I added a fix, and added a unit test specifically for GAN / multi-optimizers.
Codecov Report
@@ Coverage Diff @@
## master #1529 +/- ##
======================================
Coverage 89% 89%
======================================
Files 68 68
Lines 3811 3906 +95
======================================
+ Hits 3385 3471 +86
- Misses 426 435 +9 |
Hey @williamFalcon looks like there is an incompatibility between PyTorch Lightning and PyTorch 1.5.0 (released last night) that's causing the CI failures:
Is someone on your end looking into this? Happy to file an issue. |
probably issue is not needed we are already working on it #1552 |
This pull request is now in conflict... :( |
@tgaddair fixed on master. want to rebase so we can merge this? |
96a190e
to
a76736a
Compare
we do not use tox anymore... |
I see, I mistook a failure due to a corrupt pip cache for a tox issue. Is there a way to refresh the pip cache? I just commented out that step for now to get tests to pass, not sure what will happen when I restore that line. |
I have tried dropping cache some time ago and didn't find it... |
Docker images would be much better, I agree. Looks like I was able to refresh the cache by running |
🎆 |
Hey @Borda @williamFalcon looks like the Drone GPU test timeout was recently changed from 30 minutes to 15 minutes. Before this PR, those tests took about 4:30 minutes to run, and were taking about 18 minutes with this PR. However, 10 minutes of that was attributed to the time to build Apex. As I mentioned in a previous comment, it looks like Apex was failing to install correctly before due to the lack of the nvcc compiler in the image you were using. The new image has nvcc, and can successfully build Apex, but takes a very long time. I just ran a test where I removed the line to install Apex, and the tests now pass in about 8:30 minutes (less than the time for CircleCI to finish). I believe this is consistent with the current test behavior, but I wanted to get your thoughts on Apex: do you feel it's worth building it in these tests and waiting the extra 10 minutes? If so, we can restore it in a follow-up and bump up the test timeout. |
Thanks for merging! |
in my opinion it just another reason to create own test image and use it for all CI as we do not want to spend most of the machine time on repetitive building/installing dependencies maybe I am missing something but without apex there is no amp support, right so the test shall fail...? |
Re #1561 (comment), after talking to @tgaddair and Meet Shah on pytorch slack, when using Horovod's DistributedOptimizer + native Amp, you need to ensure grads are synced across processes before the unscaling+infchecking. In other words, you need the following pattern: scaler.scale(loss).backward()
opt.synchronize()
# if separate scaler.unscale_(optimizer) is
# needed, eg to allow clipping unscaled gradients,
# it should come here, after opt.synchronize()
with opt.skip_synchronize():
scaler.step(opt)
scaler.update() I think a similar pattern was needed with apex. |
@mcarilli mind send a PR? ❤️ |
@Borda I contacted @williamFalcon on pytorch slack, he said he was refactoring horovod integration already and would ping me for review. |
Hey @mcarilli, I think the AMP integration should already be in place with the Horovod backend. Are you seeing issues when trying to use it? |
I haven't seen any issues, just wanted to remind about the synchronize() pattern. If it's already taken care of, ignore me. |
Thanks for clarifying and raising the issue. We should definitely double check! |
Fixes #1518.
Make the following change to your Trainer to run on GPU (single or multiple) with Horovod:
Or to run on CPU:
Then the training script can be launched via the horovodrun command-line tool, where the host/GPU allocation is specified: