Support distributed init outside runtime init #276

jeffra · 2020-06-25T16:37:13Z

Allow users to initialize torch distributed outside of deepspeed.initialize. Specifically needed for MPI discovery (e.g., AML) and other models. This also removes the need for --deepspeed_mpi flag, we will automatically attempt to discover MPI world info if we are not launched with a torch.distributed or deepspeed launcher.

TODO: Need to update documentation, remove --deepspeed_mpi flag, and add unit tests

tjruwase · 2020-06-25T17:03:55Z

deepspeed/pt/deepspeed_light.py

+            self.device = torch.device("cuda", self.local_rank)
+            self.world_size = dist.get_world_size()
+            self.global_rank = dist.get_rank()
+            logger.info("Set device to local rank {} within node.".format(


Can we move this message to where we set self.local_rank in line 125?

jeffra · 2020-12-17T20:12:16Z

Closing and moving to #608

jeffra added 8 commits June 25, 2020 07:37

deepspeed distributed with mpi discovery

47e170e

remove env var

4e36fc3

remove import

434d616

add warning about dist_init_required

51b88b0

fixes

309bbb8

add all torch distributed environ vars

b510f11

allow ability to turn off mpi discovery and change name

e0e616f

Merge branch 'master' into jeffra/ds_dist_init

4e82529

jeffra marked this pull request as ready for review June 25, 2020 22:02

jeffra requested review from ShadenSmith, samyam and tjruwase June 25, 2020 22:03

tjruwase approved these changes Jun 30, 2020

View reviewed changes

jeffra requested review from RezaYazdaniAminabadi, arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz and niumanar as code owners September 10, 2020 17:19

andrewchoi5 approved these changes Oct 19, 2020

View reviewed changes

jeffra mentioned this pull request Nov 24, 2020

Simplify dist init and only init if needed. #553

Merged

jeffra added a commit that referenced this pull request Dec 17, 2020

refactor distributed init (relates to old PR #276)

663f00f

jeffra closed this Dec 17, 2020

jeffra deleted the jeffra/ds_dist_init branch September 24, 2021 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support distributed init outside runtime init #276

Support distributed init outside runtime init #276

Uh oh!

jeffra commented Jun 25, 2020 •

edited

Loading

Uh oh!

tjruwase Jun 25, 2020

Uh oh!

jeffra commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support distributed init outside runtime init #276

Support distributed init outside runtime init #276

Uh oh!

Conversation

jeffra commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

jeffra commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffra commented Jun 25, 2020 •

edited

Loading