-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example: Add MXNet Gluon training example of MNIST. #22
Conversation
Thank you for the contribution! We will verify the correctness as soon as possible. I guess your question is whether BytePS really kicked in the whole process. The simplest way is to set BYTEPS_LOG_LEVEL=INFO, this will print out all the initialization info including init every tensor. The most verbose is BYTEPS_LOG_LEVEL=TRACE, it will print out all the info during training, including every stage of (local reduce, copy, push, pull, local broadcast) of every tensor. If you can see the verbose output all the time until the training ends, it means BytePS is at least doing some work. |
@bobzhuyb Thanks for your reply. I set BYTEPS_LOG_LEVEL=TRACE and got output
That means BytePS is working well with the training script? 🤔 Btw I have the same issue with #7, child processes are not killed when main process exits. I have to kill them every time I finish testing my scripts. |
Nice. I think it's working. We'll have a look at the exit problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you rename all "allreduce" to "push_pull"? Just find and replace in code editors
Thanks for review. def step(self, batch_size, ignore_stale_grad=False):
rescale_grad = self._scale / batch_size
self._check_and_rescale_grad(rescale_grad)
...
self._allreduce_grads()
self._update(ignore_stale_grad) |
Merged. Thank you for your contribution! |
* mxnet: add DistributedTrainer for mxnet gluon API * example & test: add mxnet gluon example of MNIST training scripts * example & test: use the correct distributed trainer * mxnet: fix description in DistributedTrainer doc
* 1bit: use double * 1bit: fix * misc: new line eof
Hi,
I followed the code of horovod, add example for mxnet gluon.
Two changes here:
cd byteps bash example/mxnet-gluon/run_mnist_gluon.sh
I test the script with ENVS:
The expected output:
It seemed OK, but I'm not sure if it was training with byteps correctly. Because even I use
gluon.Trainer
instead, the script will still work fine, while there is no parameter broadcast to other GPU in this process. So any ideas I could confirm that?