Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739

Merged
merged 2 commits into from
Aug 2, 2020

Conversation

ibeltagy
Copy link
Contributor

What does this PR do?

Fixes #2407

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team July 28, 2020 15:34
@Borda Borda added the bug Something isn't working label Jul 28, 2020
@@ -528,7 +528,7 @@ def ddp_train(self, process_idx, q, model, is_master=False, proc_offset=0):
if is_master:
# source of truth is cuda for gpu idx
gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
gpu_idx = int(gpus[self.local_rank])
gpu_idx = self.local_rank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won’t work... because if you have access to gpus 4,5,6,7 and you request “2,3” you’re asking for “6,7”

Copy link
Contributor Author

@ibeltagy ibeltagy Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, having PL run your mode on GPUs 6,7 is the expected behavior in this case.

Copy link
Contributor Author

@ibeltagy ibeltagy Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes another problem with ddp. If gpus=3 and CUDA_VISIBLE_DEVICES=4,5,6,7, ddp will run only two jobs on GPUs 5,6, and the job on GPU4 won't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree but here's what happens:

gpus available: 0, 1, 2, 3, 4, 5
index: 0, 1, 2, 3, 4, 5
gpu[2] = 2

when you set CUDA_VISIBLE_DEVICES your numbering changes
CUDA_VISIBLE_DEVICES='2, 4,5'
now your indexes 0, 1, 2

So once you set visible devices the mapping changes:
gpus[0] = 2
gpus[2] = 5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share code that breaks so i can reproduce and verify? your fix might fix this problem but it's likely to break other DDP settings

@mergify mergify bot requested a review from a team July 28, 2020 19:55
@ibeltagy ibeltagy requested review from williamFalcon and removed request for a team July 31, 2020 05:18
@mergify mergify bot requested a review from a team July 31, 2020 05:19
@mergify mergify bot requested a review from a team July 31, 2020 12:16
@williamFalcon
Copy link
Contributor

I suspect that although this will fix the problem you mentioned, it will break other setups. Mind adding a test first to show that it fails and then a test showing that it passes with the fix?

our CI uses 2 GPus, so you can base the test off of that.

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is correct, just pls add test for this case, our test are running on device with 2xK80

@mergify mergify bot requested a review from a team July 31, 2020 12:22
@ibeltagy
Copy link
Contributor Author

Can you point me to a similar unit test that I can follow?

@Borda
Copy link
Member

Borda commented Jul 31, 2020

Can you point me to a similar unit test that I can follow?

have look at tests/models/test_gpu.py

@ananyahjha93 ananyahjha93 self-requested a review July 31, 2020 15:44
@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #2739 into master will increase coverage by 0%.
The diff coverage is 0%.

@@          Coverage Diff           @@
##           master   #2739   +/-   ##
======================================
  Coverage      91%     91%           
======================================
  Files          76      76           
  Lines        6787    6786    -1     
======================================
  Hits         6150    6150           
+ Misses        637     636    -1     

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add test for this case 🐰

@mergify mergify bot requested a review from a team August 1, 2020 07:18
@williamFalcon
Copy link
Contributor

williamFalcon commented Aug 2, 2020

Ok, tested in a simple case and it worked. Merged and tested in a more complicated node and my worries came true.

This PR introduces a new bug where with ddp local rank will always be 0 on the master node... so, when cuda visible devices is something else, the master should pull the 0th device index and NOT always run on 0.

Here's the issue:

python train.py --gpus '4,5' --distributed_backend 'ddp'

WIth the fix in this PR, the GPUs used will actually be 0 and 5. Since local_rank=0 always for the master process.

In the PR where I fix this issue #2796, then i will actually use GPUs 4, 5 and NOT 0, 5. Since master has a local rank=0... it pulls the 0th GPU index which is 4 and thus training starts correctly.

@williamFalcon williamFalcon merged commit 38fce2e into Lightning-AI:master Aug 2, 2020
@williamFalcon williamFalcon mentioned this pull request Aug 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cuda runtime error (101) : invalid device ordinal
3 participants