Set correct device ids in DDP [wip] #4297

awaelchli · 2020-10-21T23:21:02Z

Fixes #3791
Fixes #4171
Fixes #3865 (maybe, need more info)

All of these commands work now and run on the correct devices:

running on gpu 1 and 2:

python image_classifier.py --gpus "1,2" 
python image_classifier.py --gpus "1,2" --distributed_backend ddp

running on gpu 1 and 2 with CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICES="1,2" python image_classifier.py --gpus "0,1" 
CUDA_VISIBLE_DEVICES="1,2" python image_classifier.py --gpus "0,1" --distributed_backend ddp
# or equivalent
CUDA_VISIBLE_DEVICES="1,2" python image_classifier.py --gpus 2 --distributed_backend ddp

non-consecutive ids also work
running on gpu 1 and 3:

python image_classifier.py --gpus "1,3" 
python image_classifier.py --gpus "1,3"  --distributed_backend ddp

equivalent with CUDA_VISIBLE_DEVICES

CUDA_VISIBLE_DEVICES="1,3" python image_classifier.py --gpus "0,1" 
CUDA_VISIBLE_DEVICES="1,3" python image_classifier.py --gpus "0,1"  --distributed_backend ddp
# or 
CUDA_VISIBLE_DEVICES="1,3" python image_classifier.py --gpus 2  --distributed_backend ddp

debug c d dd d d d ads d d d f rank f v d d d d d d d d d d d set drop PL_DDP_PID clean up keep set gpus revert Revert "drop PL_DDP_PID" This reverts commit 7d88cae. d pid gpus clean up clean up misconfig? misconfig clean clean

pep8speaks · 2020-10-21T23:21:07Z

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-24 21:33:15 UTC

codecov · 2020-10-22T00:37:13Z

Codecov Report

Merging #4297 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4297   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         111     111           
  Lines        8021    7993   -28     
======================================
- Hits         7459    7434   -25     
+ Misses        562     559    -3

teddykoker

LGTM :)

awaelchli · 2020-10-22T07:21:27Z

please hold back with merge until I have @williamFalcon 's review/approval because it's his ddp code. Thanks

williamFalcon · 2020-10-22T15:43:39Z

looking now. this is great!

awaelchli · 2020-10-23T17:58:50Z

@ananthsub maybe you want to have a look here too. The more eyes on ddp the better :)

ananthsub · 2020-10-23T18:14:28Z

pytorch_lightning/accelerators/accelerator_connector.py

@@ -206,7 +206,7 @@ def select_accelerator(self):

        # ddp script mode uses the same flags as TE
        # TODO: decouple from TE
-        if os.environ.get('PL_DDP_PID', False):
+        if os.environ.get('PL_IN_DDP_SUBPROCESS', False):


is there documentation around what env vars lightning sets anywhere?

No, I am sure there is no documentation about these. They are internally used only, not meant to be modified by user.
Should they be documented? I'm not sure

ah I meant for contributors/developers primarily. I should probably read the code again too haha

ah, good point. where could env variables be documented, they are like global variables. maybe at the top of file of the accelerator base class. I have no good suggestions :)

ananthsub · 2020-10-23T18:15:52Z

pytorch_lightning/accelerators/ddp_accelerator.py

@@ -180,7 +173,7 @@ def set_world_ranks(self, process_idx):
        self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes

    def model_to_device(self, model, process_idx):


do we need process_idx at all then?

yes, we can probably remove it. the device the model is on is anyway not related to the process index, which was actually the root cause of the bug here.

ananthsub

I added a few n00b questions but LGTM!

ananthsub · 2020-10-23T22:02:40Z

pytorch_lightning/accelerators/ddp_spawn_accelerator.py

@@ -162,7 +162,7 @@ def set_world_ranks(self, process_idx):
        self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes

    def model_to_device(self, model, process_idx, is_master):
-        gpu_idx = process_idx
+        gpu_idx = self.trainer.data_parallel_device_ids[self.trainer.local_rank]


does this same fix apply to the ddp_torchelastic_accelerator? are there other DDP accelerators which need this change too?

yes, some of these accelerators like torchelastic and ddp2 should also get this fix. The problem is, I didn't want to touch these because I cannot test the fix myself, lacking multi node setup. It would be good to follow up on this.

sorry, i actually don't think the others need this fix.

The reason is that torchelastic and co derive their process numbers from the environment, so they don't need to do this (local rank, global rank, etc).

Our DDP is a special case where we are acting like torchelastic, so we need to set this stuff up manually

mergify · 2020-10-24T20:57:15Z

This pull request is now in conflict... :(

williamFalcon · 2020-10-24T21:33:29Z

ok, just verified that this works!

williamFalcon · 2020-10-24T21:36:25Z

@ananthsub, just pushed an rc.

Mind checking on slurm and TE to make sure this is fine so we don't run into issues during the patch on Tuesday?

repro

5148f25

debug c d dd d d d ads d d d f rank f v d d d d d d d d d d d set drop PL_DDP_PID clean up keep set gpus revert Revert "drop PL_DDP_PID" This reverts commit 7d88cae. d pid gpus clean up clean up misconfig? misconfig clean clean

awaelchli added bug Something isn't working distributed Generic distributed-related topic labels Oct 21, 2020

awaelchli added this to the 1.0.x milestone Oct 21, 2020

awaelchli changed the title ~~Set correct device ids in DDP~~ [WIP] Set correct device ids in DDP Oct 21, 2020

awaelchli changed the title ~~[WIP] Set correct device ids in DDP~~ [WIP] Set correct device ids in DDP [skip ci] Oct 21, 2020

fix pep

3d71dc7

changelog

ef93040

awaelchli changed the title ~~[WIP] Set correct device ids in DDP [skip ci]~~ Set correct device ids in DDP Oct 22, 2020

awaelchli marked this pull request as ready for review October 22, 2020 01:41

awaelchli requested review from ananyahjha93, Borda, justusschock, nateraw, SeanNaren, tchaton, teddykoker and williamFalcon as code owners October 22, 2020 01:41

mergify bot requested a review from a team October 22, 2020 01:42

remove script

01baf92

teddykoker approved these changes Oct 22, 2020

View reviewed changes

justusschock approved these changes Oct 22, 2020

View reviewed changes

mergify bot requested a review from a team October 22, 2020 05:21

SkafteNicki approved these changes Oct 22, 2020

View reviewed changes

Merge branch 'master' into bugfix/device-ordinal-2

5b72460

Borda approved these changes Oct 22, 2020

View reviewed changes

Borda changed the title ~~Set correct device ids in DDP~~ Set correct device ids in DDP [wip] Oct 22, 2020

Merge branch 'master' into bugfix/device-ordinal-2

d717f44

ananthsub reviewed Oct 23, 2020

View reviewed changes

ananthsub approved these changes Oct 23, 2020

View reviewed changes

ananthsub reviewed Oct 23, 2020

View reviewed changes

Merge branch 'master' into bugfix/device-ordinal-2

1250026

williamFalcon merged commit 28d45a2 into master Oct 24, 2020

awaelchli deleted the bugfix/device-ordinal-2 branch October 24, 2020 21:39

awaelchli mentioned this pull request Oct 25, 2020

DDP is not working for me... #3640

Closed

justusschock mentioned this pull request Dec 1, 2020

CUDA OOM when initializing DDP #4705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set correct device ids in DDP [wip] #4297

Set correct device ids in DDP [wip] #4297

awaelchli commented Oct 21, 2020 •

edited

Loading

pep8speaks commented Oct 21, 2020 •

edited

Loading

codecov bot commented Oct 22, 2020 •

edited

Loading

teddykoker left a comment

awaelchli commented Oct 22, 2020

williamFalcon commented Oct 22, 2020

awaelchli commented Oct 23, 2020

ananthsub Oct 23, 2020

awaelchli Oct 23, 2020

ananthsub Oct 23, 2020 •

edited

Loading

awaelchli Oct 23, 2020

ananthsub Oct 23, 2020

awaelchli Oct 23, 2020

ananthsub left a comment

ananthsub Oct 23, 2020

awaelchli Oct 23, 2020

williamFalcon Oct 24, 2020

mergify bot commented Oct 24, 2020

williamFalcon commented Oct 24, 2020

williamFalcon commented Oct 24, 2020

		@@ -180,7 +173,7 @@ def set_world_ranks(self, process_idx):
		self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes

		def model_to_device(self, model, process_idx):

Set correct device ids in DDP [wip] #4297

Set correct device ids in DDP [wip] #4297

Conversation

awaelchli commented Oct 21, 2020 • edited Loading

pep8speaks commented Oct 21, 2020 • edited Loading

Comment last updated at 2020-10-24 21:33:15 UTC

codecov bot commented Oct 22, 2020 • edited Loading

Codecov Report

teddykoker left a comment

Choose a reason for hiding this comment

awaelchli commented Oct 22, 2020

williamFalcon commented Oct 22, 2020

awaelchli commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananthsub Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananthsub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Oct 24, 2020

williamFalcon commented Oct 24, 2020

williamFalcon commented Oct 24, 2020

awaelchli commented Oct 21, 2020 •

edited

Loading

pep8speaks commented Oct 21, 2020 •

edited

Loading

codecov bot commented Oct 22, 2020 •

edited

Loading

ananthsub Oct 23, 2020 •

edited

Loading