Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MNIST training example not exiting on CPU instances after training completion #5065

Closed
sampathchanda opened this issue Feb 20, 2017 · 2 comments

Comments

@sampathchanda
Copy link

Hi,

Am running MNIST training on 2 CPU instances using MXNET with 10 epochs (Default script, provided). However, even after the training is all done ( I could see that the execution reaches the end of train_mnist.py script), execution isn't exiting.

Can anyone help me with this issue.

Environment info

Operating System: Amazon Linux
Package used (Python/R/Scala/Julia): Python
MXNet Installed from sources
MXNet commit hash (git rev-parse HEAD): 266e439
Python version and distribution: Python 2.7.12

Steps to reproduce

  1. cd $HOME/mxnet/examples/image_classification/
  2. ../../tools/launch.py -n 1 -H hosts python train_mnist.py

hosts file consists of the worker hostnames (2 workers in my case, as follows):
deeplearning-worker1
deeplearning-worker2

@Soonhwan-Kwon
Copy link
Contributor

Soonhwan-Kwon commented Apr 16, 2017

If you launched it by ../../tools/launch.py, there is ../../tools/kill-mxnet.py to kill the process.

@sampathchanda
Copy link
Author

sampathchanda commented Apr 17, 2017

@Soonhwan-Kwon Thanks for the comment. But it seems that there was some issue with the way i setup my environment for running the example. I fixed the issue, with help from @qiyuangong in issue #5094.

For those who are also facing this issue, please follow the steps mentioned by @qiyuangong in the issue #5094 to get the scripts working as expected. Am now able to successfully run the example given on multi CPU cluster, and hence closing this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants