-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Horovod distributed backend to set the root_gpu property #1669
Conversation
Hello @tgaddair! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-05-01 15:46:13 UTC |
Codecov Report
@@ Coverage Diff @@
## master #1669 +/- ##
======================================
- Coverage 88% 88% -0%
======================================
Files 69 69
Lines 4129 4133 +4
======================================
+ Hits 3653 3656 +3
- Misses 476 477 +1 |
@@ -570,8 +570,9 @@ def horovod_train(self, model): | |||
|
|||
if torch.cuda.is_available() and self.on_gpu: | |||
# Horovod: pin GPU to local rank | |||
torch.cuda.set_device(hvd.local_rank()) | |||
model.cuda(hvd.local_rank()) | |||
self.root_gpu = hvd.local_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall it be rather protected self._root_gpu
also add it to Trainer.init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the logic to the init method. Seems there isn't a protected _root_gpu
. I was using the existing public root_gpu
property set elsewhere, following this pattern: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/distrib_data_parallel.py#L348
Let me know if the new approach looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦝
This pull request is now in conflict... :( |
I saw it also with other PRs, no idea what changed... |
Seems this is a problem on master: https://circleci.com/gh/PyTorchLightning/pytorch-lightning/24909?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link |
Looking at the build history, it looks like #1498 was when the failure started occurring. Maybe @williamFalcon knows more. |
I agree that it is not linked to this PR, but your mentioned PRs seems that did not touch it either and its last tests were passing... |
remove the lbfgs accuracy requirement but leave the test. we need it for closure testing with lbfgs |
See #1678 |
This pull request is now in conflict... :( |
Hey @Borda, ready to land? |
yes, sure, just need to collect other approvals from @PyTorchLightning/core-contributors or @williamFalcon |
It was discovered that when restoring a model from a checkpoint (
trainer.restore(path, on_gpu=True)
) then performing testing, the model will not be placed on the correct GPU device, becauseroot_gpu
was not set when using the Horovod backend.