Ability to initialize distributed backend outside deepspeed runtime #608

jeffra · 2020-12-17T19:33:17Z

Refactor mpi and aml patching into deepspeed.utils.distributed
Support calling deepspeed.init_distributed() outside deepspeed.initialize runtime. This is required to support building models that require torch distributed (e.g., model-parallelism). Currently running a model like megatron will break if using an mpi launcher or on aml.
Should fix issues in AssertionError: MPI world size 16 does not match torch world size 8 #461

deepspeed/utils/distributed.py

tests/unit/common.py

…speedai#608) (deepspeedai#71)

jeffra added 2 commits December 17, 2020 11:25

refactor distributed init (relates to old PR #276)

663f00f

remove dead code

f857fef

jeffra requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, samyam and tjruwase as code owners December 17, 2020 19:33

jeffra added 4 commits December 17, 2020 11:39

fix import errors

c248980

fix another import

d8700f7

another fix, runnable now

33ec22a

fixes for unit test backend

0587ae8

jeffra mentioned this pull request Dec 17, 2020

Support distributed init outside runtime init #276

Closed

jeffra changed the title ~~Distributed init outside runtime init~~ Initialize distributed backend outside deepspeed runtime Dec 17, 2020

jeffra changed the title ~~Initialize distributed backend outside deepspeed runtime~~ Ability to initialize distributed backend outside deepspeed runtime Dec 17, 2020

jeffra mentioned this pull request Dec 17, 2020

updates across examples to use deepspeed.init_distributed() deepspeedai/DeepSpeedExamples#71

Merged

tjruwase approved these changes Dec 17, 2020

View reviewed changes

jeffra added 4 commits December 17, 2020 20:30

update DSE

af901ea

update mpi documentation

f066482

formatting

5a30339

fix warnings in cluster install script

af733a9

ShadenSmith approved these changes Dec 17, 2020

View reviewed changes

deepspeed/utils/distributed.py Outdated Show resolved Hide resolved

deepspeed/utils/distributed.py Outdated Show resolved Hide resolved

tests/unit/common.py Outdated Show resolved Hide resolved

jeffra added 3 commits December 17, 2020 20:46

add verbose flag

3ec4f25

more verbose logs

ffc67df

switch unit test backend to use deepspeed.init_dist

2df5aa9

jeffra merged commit 7435b2f into master Dec 18, 2020

jeffra deleted the jeffra/callable-mpi branch December 18, 2020 07:17

awan-10 mentioned this pull request Jan 5, 2021

Change dist to torch.distributed to fix a typo/bug in assert statements #638

Merged

jeffra mentioned this pull request Jan 7, 2021

WarmupDecayLR.params.total_num_steps - total or per gpu? #633

Closed

jeffra linked an issue Jan 7, 2021 that may be closed by this pull request

WarmupDecayLR.params.total_num_steps - total or per gpu? #633

Closed

jeffra mentioned this pull request Feb 19, 2021

Dist testing backend fixes, etc. #708

Merged

5 tasks

rraminen pushed a commit to rraminen/DeepSpeed that referenced this pull request Apr 28, 2021

updates across examples to use deepspeed.init_distributed() (see deep…

78d69cb

…speedai#608) (deepspeedai#71)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ability to initialize distributed backend outside deepspeed runtime #608

Ability to initialize distributed backend outside deepspeed runtime #608

Uh oh!

jeffra commented Dec 17, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ability to initialize distributed backend outside deepspeed runtime #608

Ability to initialize distributed backend outside deepspeed runtime #608

Uh oh!

Conversation

jeffra commented Dec 17, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants