Fix Horovod distributed backend to set the root_gpu property #1669

tgaddair · 2020-04-29T23:27:01Z

It was discovered that when restoring a model from a checkpoint (trainer.restore(path, on_gpu=True)) then performing testing, the model will not be placed on the correct GPU device, because root_gpu was not set when using the Horovod backend.

pep8speaks · 2020-04-29T23:27:05Z

Hello @tgaddair! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-01 15:46:13 UTC

codecov · 2020-04-29T23:43:50Z

Codecov Report

Merging #1669 into master will decrease coverage by 0%.
The diff coverage is 55%.

@@          Coverage Diff           @@
##           master   #1669   +/-   ##
======================================
- Coverage      88%     88%   -0%     
======================================
  Files          69      69           
  Lines        4129    4133    +4     
======================================
+ Hits         3653    3656    +3     
- Misses        476     477    +1

Borda · 2020-04-30T13:19:04Z

pytorch_lightning/trainer/distrib_parts.py

@@ -570,8 +570,9 @@ def horovod_train(self, model):

        if torch.cuda.is_available() and self.on_gpu:
            # Horovod: pin GPU to local rank
-            torch.cuda.set_device(hvd.local_rank())
-            model.cuda(hvd.local_rank())
+            self.root_gpu = hvd.local_rank()


shall it be rather protected self._root_gpu also add it to Trainer.init

Moved the logic to the init method. Seems there isn't a protected _root_gpu. I was using the existing public root_gpu property set elsewhere, following this pattern: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/distrib_data_parallel.py#L348

Let me know if the new approach looks good.

Borda

LGTM 🦝

mergify · 2020-04-30T20:35:22Z

This pull request is now in conflict... :(

Borda · 2020-04-30T20:53:28Z

I saw it also with other PRs, no idea what changed...
AssertionError: This model is expected to get > 0.5 in test set (it got 0.28125)

tgaddair · 2020-04-30T21:05:25Z

I saw it also with other PRs, no idea what changed...
AssertionError: This model is expected to get > 0.5 in test set (it got 0.28125)

Seems this is a problem on master: https://circleci.com/gh/PyTorchLightning/pytorch-lightning/24909?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

tgaddair · 2020-04-30T21:11:51Z

I saw it also with other PRs, no idea what changed...
AssertionError: This model is expected to get > 0.5 in test set (it got 0.28125)

Seems this is a problem on master: https://circleci.com/gh/PyTorchLightning/pytorch-lightning/24909?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Looking at the build history, it looks like #1498 was when the failure started occurring. Maybe @williamFalcon knows more.

Borda · 2020-04-30T21:19:58Z

Looking at the build history, it looks like #1498 was when the failure started occurring. Maybe @williamFalcon knows more.

I agree that it is not linked to this PR, but your mentioned PRs seems that did not touch it either and its last tests were passing...
Do we even need this test test_lbfgs_cpu_model ?

williamFalcon · 2020-05-01T10:20:19Z

remove the lbfgs accuracy requirement but leave the test. we need it for closure testing with lbfgs

Borda · 2020-05-01T11:48:02Z

See #1678

mergify · 2020-05-01T14:42:14Z

This pull request is now in conflict... :(

tgaddair · 2020-05-01T16:56:06Z

Hey @Borda, ready to land?

Borda · 2020-05-01T18:06:51Z

Hey @Borda, ready to land?

yes, sure, just need to collect other approvals from @PyTorchLightning/core-contributors or @williamFalcon

mergify bot requested a review from a team April 29, 2020 23:27

Borda added the bug Something isn't working label Apr 30, 2020

Borda added this to the 0.7.6 milestone Apr 30, 2020

Borda reviewed Apr 30, 2020

View reviewed changes

mergify bot requested a review from a team April 30, 2020 13:20

Borda added the priority: 0 High priority task label Apr 30, 2020

Borda approved these changes Apr 30, 2020

View reviewed changes

mergify bot requested a review from a team April 30, 2020 20:34

Borda added the ready PRs ready to be merged label Apr 30, 2020

Borda added 2 commits April 30, 2020 23:29

params

38cf947

drop acc

c501e26

tgaddair and others added 6 commits May 1, 2020 14:36

Fix Horovod distributed backend to set the root_gpu

b092416

Fixed test

6deaacd

Fixed tests

f497632

Fixed lint

dcfd7d7

Set root_gpu during initialization

f24f0d5

chlog

e8be340

Borda force-pushed the horovod_fix branch from bcaafc3 to e8be340 Compare May 1, 2020 12:37

Borda changed the title ~~Fix Horovod distributed backend to set the root_gpu property~~ [blocked by #1678] Fix Horovod distributed backend to set the root_gpu property May 1, 2020

Merge branch 'master' into horovod_fix

5a1d12f

Borda changed the title ~~[blocked by #1678] Fix Horovod distributed backend to set the root_gpu property~~ Fix Horovod distributed backend to set the root_gpu property May 1, 2020

williamFalcon merged commit 2950f66 into Lightning-AI:master May 1, 2020

tgaddair deleted the horovod_fix branch May 1, 2020 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Horovod distributed backend to set the root_gpu property #1669

Fix Horovod distributed backend to set the root_gpu property #1669

tgaddair commented Apr 29, 2020

pep8speaks commented Apr 29, 2020 •

edited

Loading

codecov bot commented Apr 29, 2020 •

edited

Loading

Borda Apr 30, 2020

tgaddair Apr 30, 2020

Borda left a comment

mergify bot commented Apr 30, 2020

Borda commented Apr 30, 2020

tgaddair commented Apr 30, 2020

tgaddair commented Apr 30, 2020

Borda commented Apr 30, 2020

williamFalcon commented May 1, 2020

Borda commented May 1, 2020

mergify bot commented May 1, 2020

tgaddair commented May 1, 2020

Borda commented May 1, 2020

Fix Horovod distributed backend to set the root_gpu property #1669

Fix Horovod distributed backend to set the root_gpu property #1669

Conversation

tgaddair commented Apr 29, 2020

pep8speaks commented Apr 29, 2020 • edited Loading

Comment last updated at 2020-05-01 15:46:13 UTC

codecov bot commented Apr 29, 2020 • edited Loading

Codecov Report

Borda Apr 30, 2020

Choose a reason for hiding this comment

tgaddair Apr 30, 2020

Choose a reason for hiding this comment

Borda left a comment

Choose a reason for hiding this comment

mergify bot commented Apr 30, 2020

Borda commented Apr 30, 2020

tgaddair commented Apr 30, 2020

tgaddair commented Apr 30, 2020

Borda commented Apr 30, 2020

williamFalcon commented May 1, 2020

Borda commented May 1, 2020

mergify bot commented May 1, 2020

tgaddair commented May 1, 2020

Borda commented May 1, 2020

pep8speaks commented Apr 29, 2020 •

edited

Loading

codecov bot commented Apr 29, 2020 •

edited

Loading