-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set correct device ids in DDP [wip] #4297
Conversation
Hello @awaelchli! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-10-24 21:33:15 UTC |
Codecov Report
@@ Coverage Diff @@
## master #4297 +/- ##
======================================
Coverage 93% 93%
======================================
Files 111 111
Lines 8021 7993 -28
======================================
- Hits 7459 7434 -25
+ Misses 562 559 -3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
please hold back with merge until I have @williamFalcon 's review/approval because it's his ddp code. Thanks |
looking now. this is great! |
@ananthsub maybe you want to have a look here too. The more eyes on ddp the better :) |
@@ -206,7 +206,7 @@ def select_accelerator(self): | |||
|
|||
# ddp script mode uses the same flags as TE | |||
# TODO: decouple from TE | |||
if os.environ.get('PL_DDP_PID', False): | |||
if os.environ.get('PL_IN_DDP_SUBPROCESS', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there documentation around what env vars lightning sets anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am sure there is no documentation about these. They are internally used only, not meant to be modified by user.
Should they be documented? I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I meant for contributors/developers primarily. I should probably read the code again too haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good point. where could env variables be documented, they are like global variables. maybe at the top of file of the accelerator base class. I have no good suggestions :)
@@ -180,7 +173,7 @@ def set_world_ranks(self, process_idx): | |||
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes | |||
|
|||
def model_to_device(self, model, process_idx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need process_idx at all then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can probably remove it. the device the model is on is anyway not related to the process index, which was actually the root cause of the bug here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few n00b questions but LGTM!
@@ -162,7 +162,7 @@ def set_world_ranks(self, process_idx): | |||
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes | |||
|
|||
def model_to_device(self, model, process_idx, is_master): | |||
gpu_idx = process_idx | |||
gpu_idx = self.trainer.data_parallel_device_ids[self.trainer.local_rank] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this same fix apply to the ddp_torchelastic_accelerator? are there other DDP accelerators which need this change too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, some of these accelerators like torchelastic and ddp2 should also get this fix. The problem is, I didn't want to touch these because I cannot test the fix myself, lacking multi node setup. It would be good to follow up on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i actually don't think the others need this fix.
The reason is that torchelastic and co derive their process numbers from the environment, so they don't need to do this (local rank, global rank, etc).
Our DDP is a special case where we are acting like torchelastic, so we need to set this stuff up manually
This pull request is now in conflict... :( |
ok, just verified that this works! |
@ananthsub, just pushed an rc. Mind checking on slurm and TE to make sure this is fine so we don't run into issues during the patch on Tuesday? |
Fixes #3791
Fixes #4171
Fixes #3865 (maybe, need more info)
All of these commands work now and run on the correct devices:
running on gpu 1 and 2:
running on gpu 1 and 2 with CUDA_VISIBLE_DEVICES:
non-consecutive ids also work
running on gpu 1 and 3:
equivalent with CUDA_VISIBLE_DEVICES