Skip to content

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Dec 17, 2020

  • Refactor mpi and aml patching into deepspeed.utils.distributed
  • Support calling deepspeed.init_distributed() outside deepspeed.initialize runtime. This is required to support building models that require torch distributed (e.g., model-parallelism). Currently running a model like megatron will break if using an mpi launcher or on aml.
  • Should fix issues in AssertionError: MPI world size 16 does not match torch world size 8 #461

@jeffra jeffra changed the title Distributed init outside runtime init Initialize distributed backend outside deepspeed runtime Dec 17, 2020
@jeffra jeffra changed the title Initialize distributed backend outside deepspeed runtime Ability to initialize distributed backend outside deepspeed runtime Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WarmupDecayLR.params.total_num_steps - total or per gpu?

4 participants