Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 24 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,27 +286,39 @@ doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html).
{
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": true,
"disable_allgather": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"max_grad_norm": 1.0
"lr": 0.00015
}
},

"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
}
"enabled": true
},
"zero_optimization": true
}
```

## Multi-Node Environment Variables

When training across multiple nodes we have found it useful to support
propagating user-defined environment variables. By default DeepSpeed will
propagate all NCCL and PYTHON related environment variables that are set. If
you would like to propagate additional variables you can specify them in a
dot-file named `.deepspeed_env` that contains a new-line separated list of
`VAR=VAL` entries. The DeepSpeed launcher will look in the local path you are
executing from and also in your home directory (`~/`).

As a concrete example, some clusters require special NCCL variables to set
prior to training. The user can simply add these variables to a
`.deepspeed_env` file in their home directory that looks like this:
```
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=eth0
```
DeepSpeed will then make sure that these environment variables are set when
launching each process on every node across their training job.

# Launching DeepSpeed Training
DeepSpeed installs the entry point `deepspeed` to launch distributed training.
We illustrate an example usage of DeepSpeed with the following assumptions:
Expand Down
9 changes: 9 additions & 0 deletions deepspeed/pt/deepspeed_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@

DLTS_HOSTFILE = "/job/hostfile"
EXPORT_ENVS = ["NCCL", "PYTHONPATH"]
DEEPSPEED_ENVIRONMENT_NAME = ".deepspeed_env"
DEEPSPEED_ENVIRONMENT_PATHS = [os.path.expanduser("~"), '.']


def parse_args(args=None):
Expand Down Expand Up @@ -317,6 +319,13 @@ def main(args=None):
if any(map(lambda name: name in var, EXPORT_ENVS)):
exports += "export {}={}; ".format(var, env[var])

for environ_path in DEEPSPEED_ENVIRONMENT_PATHS:
environ_file = os.path.join(environ_path, DEEPSPEED_ENVIRONMENT_NAME)
if os.path.isfile(environ_file):
with open(environ_file, 'r') as fd:
for var in fd.readlines():
exports += "export {}; ".format(var.strip())

deepspeed_launch = [
exports,
"cd {};".format(curr_path),
Expand Down