-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739
Conversation
@@ -528,7 +528,7 @@ def ddp_train(self, process_idx, q, model, is_master=False, proc_offset=0): | |||
if is_master: | |||
# source of truth is cuda for gpu idx | |||
gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',') | |||
gpu_idx = int(gpus[self.local_rank]) | |||
gpu_idx = self.local_rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won’t work... because if you have access to gpus 4,5,6,7 and you request “2,3” you’re asking for “6,7”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, having PL run your mode on GPUs 6,7 is the expected behavior in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes another problem with ddp. If gpus=3
and CUDA_VISIBLE_DEVICES=4,5,6,7
, ddp will run only two jobs on GPUs 5,6,
and the job on GPU4 won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree but here's what happens:
gpus available: 0, 1, 2, 3, 4, 5
index: 0, 1, 2, 3, 4, 5
gpu[2] = 2
when you set CUDA_VISIBLE_DEVICES your numbering changes
CUDA_VISIBLE_DEVICES='2, 4,5'
now your indexes 0, 1, 2
So once you set visible devices the mapping changes:
gpus[0] = 2
gpus[2] = 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share code that breaks so i can reproduce and verify? your fix might fix this problem but it's likely to break other DDP settings
I suspect that although this will fix the problem you mentioned, it will break other setups. Mind adding a test first to show that it fails and then a test showing that it passes with the fix? our CI uses 2 GPus, so you can base the test off of that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct, just pls add test for this case, our test are running on device with 2xK80
Can you point me to a similar unit test that I can follow? |
have look at tests/models/test_gpu.py |
Co-authored-by: Jirka Borovec <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #2739 +/- ##
======================================
Coverage 91% 91%
======================================
Files 76 76
Lines 6787 6786 -1
======================================
Hits 6150 6150
+ Misses 637 636 -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add test for this case 🐰
Ok, tested in a simple case and it worked. Merged and tested in a more complicated node and my worries came true. This PR introduces a new bug where with ddp local rank will always be 0 on the master node... so, when cuda visible devices is something else, the master should pull the 0th device index and NOT always run on 0. Here's the issue: python train.py --gpus '4,5' --distributed_backend 'ddp' WIth the fix in this PR, the GPUs used will actually be 0 and 5. Since local_rank=0 always for the master process. In the PR where I fix this issue #2796, then i will actually use GPUs 4, 5 and NOT 0, 5. Since master has a local rank=0... it pulls the 0th GPU index which is 4 and thus training starts correctly. |
What does this PR do?
Fixes #2407
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃