Skip to content

Commit

Permalink
Sync Horovod distributed training examples with latest changes (apach…
Browse files Browse the repository at this point in the history
…e#14748)

* sync examples with latest changes in Horovod

* update README
  • Loading branch information
yuxihu authored and haohuw committed Jun 23, 2019
1 parent e0362db commit 42eb502
Show file tree
Hide file tree
Showing 4 changed files with 69 additions and 48 deletions.
22 changes: 11 additions & 11 deletions example/distributed_training-horovod/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ excellent scaling efficiency for dense models running on a large number of nodes
supports mainstream deep learning frameworks such as MXNet, TensorFlow, Keras, and PyTorch.
It is created at Uber and currently hosted by the [Linux Foundation Deep Learning](https://lfdl.io)(LF DL).

MXNet is supported in Horovod 0.16.0 [release](https://eng.uber.com/horovod-pyspark-apache-mxnet-support/).
MXNet is supported starting from Horovod 0.16.0 [release](https://eng.uber.com/horovod-pyspark-apache-mxnet-support/).

## What's New?
Compared with the standard distributed training script in MXNet which uses parameter server to
Expand All @@ -35,7 +35,7 @@ there are a large number of workers and network bandwidth is the bottleneck.
```bash
$ pip install mxnet
```
**Note**: There is a [known issue](https://github.com/horovod/horovod/issues/884) when running Horovod with MXNet on a Linux system with GCC version 5.X and above. We recommend users to build MXNet from source following this [guide](https://mxnet.incubator.apache.org/install/build_from_source.html) as a workaround for now. Also mxnet-mkl package in 1.4.0 release does not support Horovod.
**Note**: The [known issue](https://github.com/horovod/horovod/issues/884) when running Horovod with MXNet on a Linux system with GCC version 5.X and above has been resolved. Please use MXNet 1.4.1 or later releases with Horovod 0.16.2 or later releases to avoid the GCC incompatibility issue. MXNet 1.4.0 release works with Horovod 0.16.0 and 0.16.1 releases with the GCC incompatibility issue unsolved.

## Install Horovod
```bash
Expand Down Expand Up @@ -66,8 +66,8 @@ To run MXNet with Horovod, make the following additions to your training script:
3. Scale the learning rate by number of workers. Effective batch size in synchronous distributed training is scaled by
the number of workers. An increase in learning rate compensates for the increased batch size.

4. Wrap optimizer in `hvd.DistributedOptimizer`. The distributed optimizer delegates gradient computation
to the original optimizer, averages gradients using *allreduce* or *allgather*, and then applies those averaged
4. Create `hvd.DistributedTrainer` with optimizer when using Gluon API or wrap optimizer in `hvd.DistributedOptimizer` when using Module API. The distributed trainer or optimizer delegates gradient computation
to the original optimizer, averages gradients using *allreduce*, and then applies those averaged
gradients.

5. Add `hvd.broadcast_parameters` to broadcast initial variable states from rank 0 to all other processes.
Expand Down Expand Up @@ -97,12 +97,13 @@ num_workers = hvd.size()
model = ...
model.hybridize()

# Define hyper parameters
optimizer_params = ...

# Add Horovod Distributed Optimizer
# Create optimizer
optimizer_params = ...
opt = mx.optimizer.create('sgd', **optimizer_params)
opt = hvd.DistributedOptimizer(opt)

# Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)

# Initialize parameters
model.initialize(initializer, ctx=context)
Expand All @@ -112,8 +113,7 @@ params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)

# Create trainer and loss function
trainer = gluon.Trainer(params, opt, kvstore=None)
# Create loss function
loss_fn = ...

# Train model
Expand Down Expand Up @@ -178,7 +178,7 @@ model.fit(train_data,

The example commands below show how to run distributed training. See the
[Running Horovod](https://github.com/horovod/horovod/blob/master/docs/running.md)
page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.
page for more instructions.

1. To run on a machine with 4 CPUs:

Expand Down
27 changes: 15 additions & 12 deletions example/distributed_training-horovod/gluon_mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,15 @@
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.9,
help='SGD momentum (default: 0.9)')
parser.add_argument('--use-gpu', action='store_true', default=False,
help='run training on GPU (default: False)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disable training on GPU (default: False)')
args = parser.parse_args()

if not args.no_cuda:
# Disable CUDA if there are no GPUs.
if not mx.test_utils.list_gpus():
args.no_cuda = True

logging.basicConfig(level=logging.INFO)
logging.info(args)

Expand Down Expand Up @@ -113,7 +118,7 @@ def evaluate(model, data_iter, context):
hvd.init()

# Horovod: pin context to local rank
context = mx.gpu(hvd.local_rank()) if args.use_gpu else mx.cpu(hvd.local_rank())
context = mx.cpu(hvd.local_rank()) if args.no_cuda else mx.gpu(hvd.local_rank())
num_workers = hvd.size()

# Load training and validation data
Expand All @@ -124,27 +129,25 @@ def evaluate(model, data_iter, context):
model.cast(args.dtype)
model.hybridize()

# Define hyper parameters
# Create optimizer
optimizer_params = {'momentum': args.momentum,
'learning_rate': args.lr * hvd.size(),
'rescale_grad': 1.0 / args.batch_size}

# Add Horovod Distributed Optimizer
'learning_rate': args.lr * hvd.size()}
opt = mx.optimizer.create('sgd', **optimizer_params)
opt = hvd.DistributedOptimizer(opt)

# Initialize parameters
initializer = mx.init.Xavier(rnd_type='gaussian', factor_type="in",
magnitude=2)
model.initialize(initializer, ctx=context)

# Fetch and broadcast parameters
# Horovod: fetch and broadcast parameters
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)

# Create trainer, loss function and train metric
trainer = gluon.Trainer(params, opt, kvstore=None)
# Horovod: create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)

# Create loss function and train metric
loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()
metric = mx.metric.Accuracy()

Expand Down
26 changes: 15 additions & 11 deletions example/distributed_training-horovod/module_mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ def get_mnist_iterator(rank):
# Step 2: load data
train_iter, val_iter = get_mnist_iterator(hvd.rank())


# Step 3: define network
def conv_net():
# placeholder for data
Expand All @@ -119,17 +118,10 @@ def conv_net():
loss = mx.sym.SoftmaxOutput(data=fc2, name='softmax')
return loss


# Step 4: fit the model
net = conv_net()
model = mx.mod.Module(symbol=net, context=context)
optimizer_params = {'learning_rate': args.lr * hvd.size(),
'rescale_grad': 1.0 / args.batch_size}
opt = mx.optimizer.create('sgd', **optimizer_params)

# Horovod: wrap optimizer with DistributedOptimizer
opt = hvd.DistributedOptimizer(opt)

# Step 4: initialize parameters
initializer = mx.init.Xavier(rnd_type='gaussian', factor_type="in",
magnitude=2)
model.bind(data_shapes=train_iter.provide_data,
Expand All @@ -144,15 +136,27 @@ def conv_net():
hvd.broadcast_parameters(aux_params, root_rank=0)
model.set_params(arg_params=arg_params, aux_params=aux_params)

# Step 5: create optimizer
optimizer_params = {'learning_rate': args.lr * hvd.size(),
'rescale_grad': 1.0 / args.batch_size}
opt = mx.optimizer.create('sgd', **optimizer_params)

# Horovod: wrap optimizer with DistributedOptimizer
opt = hvd.DistributedOptimizer(opt)

# Step 6: fit and train model
batch_cb = None
if hvd.rank() == 0:
batch_cb = mx.callback.Speedometer(args.batch_size * hvd.size())
model.fit(train_iter, # train data
kvstore=None, # no kvstore
eval_data=val_iter, # validation data
optimizer=opt, # use SGD to train
eval_metric='acc', # report accuracy during training
batch_end_callback=mx.callback.Speedometer(args.batch_size),
batch_end_callback=batch_cb, # report training speed
num_epoch=args.epochs) # train for at most 10 dataset passes

# Step 5: evaluate model accuracy
# Step 7: evaluate model accuracy
acc = mx.metric.Accuracy()
model.score(val_iter, acc)

Expand Down
42 changes: 28 additions & 14 deletions example/distributed_training-horovod/resnet50_imagenet.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,18 +279,6 @@ def reset(self):
initializer = mx.init.Xavier(rnd_type='gaussian', factor_type="in",
magnitude=2)

# Create optimizer
optimizer_params = {'wd': args.wd,
'momentum': args.momentum,
'rescale_grad': 1.0 / batch_size,
'lr_scheduler': lr_sched}
if args.dtype == 'float16':
optimizer_params['multi_precision'] = True
opt = mx.optimizer.create('sgd', **optimizer_params)

# Horovod: wrap optimizer with DistributedOptimizer
opt = hvd.DistributedOptimizer(opt)


def train_gluon():
def evaluate(epoch):
Expand Down Expand Up @@ -320,8 +308,18 @@ def evaluate(epoch):
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)

# Create trainer, loss function and train metric
trainer = gluon.Trainer(params, opt, kvstore=None)
# Create optimizer
optimizer_params = {'wd': args.wd,
'momentum': args.momentum,
'lr_scheduler': lr_sched}
if args.dtype == 'float16':
optimizer_params['multi_precision'] = True
opt = mx.optimizer.create('sgd', **optimizer_params)

# Horovod: create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)

# Create loss function and train metric
loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()
metric = mx.metric.Accuracy()

Expand Down Expand Up @@ -410,6 +408,22 @@ def train_module():
hvd.broadcast_parameters(aux_params, root_rank=0)
mod.set_params(arg_params=arg_params, aux_params=aux_params)

# Create optimizer
# Note that when using Module API, we need to specify rescale_grad since
# we create optimizer first and wrap it with DistributedOptimizer. For
# Gluon API, it is handled in Trainer.step() function so there is no need
# to specify rescale_grad (see above train_gluon() function).
optimizer_params = {'wd': args.wd,
'momentum': args.momentum,
'rescale_grad': 1.0 / batch_size,
'lr_scheduler': lr_sched}
if args.dtype == 'float16':
optimizer_params['multi_precision'] = True
opt = mx.optimizer.create('sgd', **optimizer_params)

# Horovod: wrap optimizer with DistributedOptimizer
opt = hvd.DistributedOptimizer(opt)

# Setup validation data and callback during training
eval_data = None
if args.eval_epoch:
Expand Down

0 comments on commit 42eb502

Please sign in to comment.