example: Add MXNet Gluon training example of MNIST. #22

haoxintong · 2019-06-28T13:22:08Z

Hi,
I followed the code of horovod, add example for mxnet gluon.
Two changes here:

DistributedTrainer for mxnet gluon API

# create a distributed trainer
trainer = bps.DistributedTrainer(params, "sgd", optimizer_params)
...

for i, batch in enumerate(train_data):
    ...
     loss.backward()
     trainer.step(args.batch_size)

An example of mxnet gluon training scripts.

cd byteps
bash example/mxnet-gluon/run_mnist_gluon.sh

I test the script with ENVS:

ubuntu 18.04
mxnet 1.5.0
cuda 10.0 with GTX1070*2
python 3.5.2 (anaconda)

The expected output:

BytePS launching worker
INFO:root:Namespace(batch_size=64, dtype='float32', epochs=5, j=2, lr=0.01, momentum=0.9, no_cuda=False)
INFO:root:Namespace(batch_size=64, dtype='float32', epochs=5, j=2, lr=0.01, momentum=0.9, no_cuda=False)

...
INFO:root:[Epoch 0 Batch 900] Training: accuracy=0.961033
INFO:root:[Epoch 0 Batch 900] Training: accuracy=0.962351
INFO:root:Epoch[0]      Speed=17801.42 samples/s        Time cost=6.741037
INFO:root:Epoch[0]      Train: accuracy=0.963230        Validation: accuracy=0.985800
...

It seemed OK, but I'm not sure if it was training with byteps correctly. Because even I use gluon.Trainer instead, the script will still work fine, while there is no parameter broadcast to other GPU in this process. So any ideas I could confirm that?

bobzhuyb · 2019-06-28T13:50:15Z

Thank you for the contribution! We will verify the correctness as soon as possible. I guess your question is whether BytePS really kicked in the whole process.

The simplest way is to set BYTEPS_LOG_LEVEL=INFO, this will print out all the initialization info including init every tensor. The most verbose is BYTEPS_LOG_LEVEL=TRACE, it will print out all the info during training, including every stage of (local reduce, copy, push, pull, local broadcast) of every tensor. If you can see the verbose output all the time until the training ends, it means BytePS is at least doing some work.

haoxintong · 2019-06-28T16:05:07Z

@bobzhuyb Thanks for your reply.

I set BYTEPS_LOG_LEVEL=TRACE and got output

...
[2019-06-28 23:03:30.152507: D byteps/common/global.cc:278] Declared tensor byteps.parameter_0, declared key (not PS key): 0 rank=1
[2019-06-28 23:03:30.152593: D byteps/common/global.cc:278] Declared tensor byteps.parameter_2, declared key (not PS key): 2 rank=0
[2019-06-28 23:03:30.152621: D byteps/common/operations.cc:237] byteps.parameter_2 partitioned to 1 part(s), total_len=200, key_range=[131072, 131072] rank=0
[2019-06-28 23:03:30.152789: D byteps/common/operations.cc:237] byteps.parameter_0 partitioned to 1 part(s), total_len=80, key_range=[0, 0] rank=1
[2019-06-28 23:03:30.152905: D byteps/common/global.cc:278] Declared tensor byteps.parameter_3, declared key (not PS key): 3 rank=0
[2019-06-28 23:03:30.152927: D byteps/common/operations.cc:237] byteps.parameter_3 partitioned to 1 part(s), total_len=100000, key_range=[196608, 196608] rank=0
[2019-06-28 23:03:30.153383: D byteps/common/global.cc:278] Declared tensor byteps.parameter_4, declared key (not PS key): 4 rank=0
[2019-06-28 23:03:30.153453: D byteps/common/global.cc:278] Declared tensor byteps.parameter_1, declared key (not PS key): 1 rank=1
...
...
[2019-06-28 23:33:00.643075: T byteps/common/scheduled_queue.cc:121] Queue COORDINATE_REDUCE getTask: byteps.parameter_3_0 key: 196608 rank: 0
[2019-06-28 23:33:00.643090: T byteps/common/core_loops.cc:66] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.parameter_3_0, key=196608; Passing to the next queue.
[2019-06-28 23:33:00.643099: T byteps/common/scheduled_queue.cc:86] Queue REDUCE addTask: byteps.parameter_3_0 key: 196608 rank: 0
[2019-06-28 23:33:00.643116: T byteps/common/core_loops.cc:126] byteps.parameter_3_0 send coordinate info: Signal=0, rank=0, key=196608
[2019-06-28 23:33:00.643124: T byteps/common/scheduled_queue.cc:121] Queue COORDINATE_REDUCE getTask: byteps.parameter_7_0 key: 458752 rank: 0
[2019-06-28 23:33:00.643133: T byteps/common/core_loops.cc:66] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.parameter_7_0, key=458752; Passing to the next queue.
[2019-06-28 23:33:00.643141: T byteps/common/scheduled_queue.cc:86] Queue REDUCE addTask: byteps.parameter_7_0 key: 458752 rank: 0
...

That means BytePS is working well with the training script? 🤔

Btw I have the same issue with #7, child processes are not killed when main process exits. I have to kill them every time I finish testing my scripts.

bobzhuyb · 2019-06-28T23:23:54Z

Nice. I think it's working.

We'll have a look at the exit problem.

bobzhuyb

Would you rename all "allreduce" to "push_pull"? Just find and replace in code editors

haoxintong · 2019-07-03T01:44:28Z

Thanks for review.
I edited "allreduce" in doc, while the function _allreduce_grads(self) name is not changed, which is inherited from gluon.Trainer and will be called in trainer.step():

def step(self, batch_size, ignore_stale_grad=False):
    rescale_grad = self._scale / batch_size
    self._check_and_rescale_grad(rescale_grad)

    ...

    self._allreduce_grads()
    self._update(ignore_stale_grad)

bobzhuyb · 2019-07-03T22:46:41Z

Merged. Thank you for your contribution!

* mxnet: add DistributedTrainer for mxnet gluon API * example & test: add mxnet gluon example of MNIST training scripts * example & test: use the correct distributed trainer * mxnet: fix description in DistributedTrainer doc

* 1bit: use double * 1bit: fix * misc: new line eof

haoxintong added 3 commits June 28, 2019 20:34

mxnet: add DistributedTrainer for mxnet gluon API

b426f53

example & test: add mxnet gluon example of MNIST training scripts

63aabe0

example & test: use the correct distributed trainer

7326c2d

bobzhuyb assigned ymjiang Jul 2, 2019

bobzhuyb requested changes Jul 2, 2019

View reviewed changes

mxnet: fix description in DistributedTrainer doc

82392fc

bobzhuyb approved these changes Jul 3, 2019

View reviewed changes

bobzhuyb added 3 commits July 3, 2019 08:44

Merge branch 'master' into master

0d91861

Merge branch 'master' into master

258a1c8

Merge branch 'master' into master

a590e31

bobzhuyb merged commit 7a70ac5 into bytedance:master Jul 3, 2019

bobzhuyb self-assigned this Jul 3, 2019

haoxintong mentioned this pull request Jul 5, 2019

mxnet: code style improvement and gluon parameter broadcast enhancement #48

Merged

pleasantrabbit pushed a commit that referenced this pull request Jul 13, 2020

1bit: use double for scaling (#22)

ae30478

* 1bit: use double * 1bit: fix * misc: new line eof

DeruiLiu mentioned this pull request Aug 6, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

pleasantrabbit pushed a commit that referenced this pull request Nov 3, 2020

Allow overwriting the zmq download url (#22)

d540247

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example: Add MXNet Gluon training example of MNIST. #22

example: Add MXNet Gluon training example of MNIST. #22

haoxintong commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019

haoxintong commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019

bobzhuyb left a comment

haoxintong commented Jul 3, 2019

bobzhuyb commented Jul 3, 2019

example: Add MXNet Gluon training example of MNIST. #22

example: Add MXNet Gluon training example of MNIST. #22

Conversation

haoxintong commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019

haoxintong commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019

bobzhuyb left a comment

Choose a reason for hiding this comment

haoxintong commented Jul 3, 2019

bobzhuyb commented Jul 3, 2019