Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Horovod distributed backend to set the root_gpu property #1669

Merged
merged 9 commits into from
May 1, 2020

Conversation

tgaddair
Copy link
Contributor

It was discovered that when restoring a model from a checkpoint (trainer.restore(path, on_gpu=True)) then performing testing, the model will not be placed on the correct GPU device, because root_gpu was not set when using the Horovod backend.

@pep8speaks
Copy link

pep8speaks commented Apr 29, 2020

Hello @tgaddair! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-01 15:46:13 UTC

@mergify mergify bot requested a review from a team April 29, 2020 23:27
@codecov
Copy link

codecov bot commented Apr 29, 2020

Codecov Report

Merging #1669 into master will decrease coverage by 0%.
The diff coverage is 55%.

@@          Coverage Diff           @@
##           master   #1669   +/-   ##
======================================
- Coverage      88%     88%   -0%     
======================================
  Files          69      69           
  Lines        4129    4133    +4     
======================================
+ Hits         3653    3656    +3     
- Misses        476     477    +1     

@Borda Borda added the bug Something isn't working label Apr 30, 2020
@Borda Borda added this to the 0.7.6 milestone Apr 30, 2020
@@ -570,8 +570,9 @@ def horovod_train(self, model):

if torch.cuda.is_available() and self.on_gpu:
# Horovod: pin GPU to local rank
torch.cuda.set_device(hvd.local_rank())
model.cuda(hvd.local_rank())
self.root_gpu = hvd.local_rank()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall it be rather protected self._root_gpu also add it to Trainer.init

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the logic to the init method. Seems there isn't a protected _root_gpu. I was using the existing public root_gpu property set elsewhere, following this pattern: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/distrib_data_parallel.py#L348

Let me know if the new approach looks good.

@mergify mergify bot requested a review from a team April 30, 2020 13:20
@Borda Borda added the priority: 0 High priority task label Apr 30, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦝

@mergify mergify bot requested a review from a team April 30, 2020 20:34
@Borda Borda added the ready PRs ready to be merged label Apr 30, 2020
@mergify
Copy link
Contributor

mergify bot commented Apr 30, 2020

This pull request is now in conflict... :(

@Borda
Copy link
Member

Borda commented Apr 30, 2020

I saw it also with other PRs, no idea what changed...
AssertionError: This model is expected to get > 0.5 in test set (it got 0.28125)

@tgaddair
Copy link
Contributor Author

I saw it also with other PRs, no idea what changed...
AssertionError: This model is expected to get > 0.5 in test set (it got 0.28125)

Seems this is a problem on master: https://circleci.com/gh/PyTorchLightning/pytorch-lightning/24909?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

@tgaddair
Copy link
Contributor Author

I saw it also with other PRs, no idea what changed...
AssertionError: This model is expected to get > 0.5 in test set (it got 0.28125)

Seems this is a problem on master: https://circleci.com/gh/PyTorchLightning/pytorch-lightning/24909?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Looking at the build history, it looks like #1498 was when the failure started occurring. Maybe @williamFalcon knows more.

@Borda
Copy link
Member

Borda commented Apr 30, 2020

Looking at the build history, it looks like #1498 was when the failure started occurring. Maybe @williamFalcon knows more.

I agree that it is not linked to this PR, but your mentioned PRs seems that did not touch it either and its last tests were passing...
Do we even need this test test_lbfgs_cpu_model ?

@williamFalcon
Copy link
Contributor

remove the lbfgs accuracy requirement but leave the test. we need it for closure testing with lbfgs

@Borda
Copy link
Member

Borda commented May 1, 2020

See #1678

@Borda Borda changed the title Fix Horovod distributed backend to set the root_gpu property [blocked by #1678] Fix Horovod distributed backend to set the root_gpu property May 1, 2020
@mergify
Copy link
Contributor

mergify bot commented May 1, 2020

This pull request is now in conflict... :(

@Borda Borda changed the title [blocked by #1678] Fix Horovod distributed backend to set the root_gpu property Fix Horovod distributed backend to set the root_gpu property May 1, 2020
@tgaddair
Copy link
Contributor Author

tgaddair commented May 1, 2020

Hey @Borda, ready to land?

@Borda
Copy link
Member

Borda commented May 1, 2020

Hey @Borda, ready to land?

yes, sure, just need to collect other approvals from @PyTorchLightning/core-contributors or @williamFalcon

@williamFalcon williamFalcon merged commit 2950f66 into Lightning-AI:master May 1, 2020
@tgaddair tgaddair deleted the horovod_fix branch May 1, 2020 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants