Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

now I run the "example/image-classification" with two machines,long time nothing respond #2117

Closed
n080509 opened this issue May 12, 2016 · 6 comments

Comments

@n080509
Copy link

n080509 commented May 12, 2016

now I run the "example/image-classification" with two machines
10.136.159.116
10.136.159.117

I run the the command on the 10.136.159.116 as follow,but nothing respond,pls help me ,how can I do ?

root@116:~/mxnet/example/image-classification#../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_sync

@qiaohaijun
Copy link
Contributor

Is single machine ok?

@n080509
Copy link
Author

n080509 commented May 12, 2016

now i run it as follow,how can i do next

root@116:~/mxnet/example/image-classification# export PS_VERBOSE=1; ../../tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
python train_mnist.py --kv-store dist_sync
2016-05-12 05:34:37,993 INFO rsync /home/slave/mxnet/example/image-classification/ -> 10.136.159.116:/tmp/mxnet
[05:34:38] src/van.cc:65: Bind to role=scheduler, id=1, ip=0.0.0.116, port=9098
2016-05-12 05:34:38,203 INFO rsync /home/slave/mxnet/example/image-classification/ -> 10.136.159.117:/tmp/mxnet
^CTraceback (most recent call last):
File "../../tools/launch.py", line 79, in
main()
File "../../tools/launch.py", line 64, in main
ssh.submit(args)
File "/home/slave/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 76, in submit
pscmd=(' '.join(args.command)))
File "/home/slave/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 422, in submit
pserver.join()
File "/home/slave/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 371, in join
self.thread.join(100)
File "/usr/lib/python2.7/threading.py", line 960, in join
self.__block.wait(delay)
File "/usr/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt

@n080509
Copy link
Author

n080509 commented May 12, 2016

2台机器上运行不行

@n080509
Copy link
Author

n080509 commented May 12, 2016

后来试了一下单机上多WORK也不行
root@116:~/mxnet/example/image-classification# ../../tools/launch.py -n 2 -s 1 python train_mnist.py --kv-store dist_sync
python train_mnist.py --kv-store dist_sync
[06:29:44] src/van.cc:65: Bind to role=scheduler, id=1, ip=0.0.0.116, port=9102
[06:29:44] src/van.cc:65: Bind to role=server, ip=10.136.159.116, port=47526
[06:29:44[] src/van.cc:65: Bind to role=worker, ip=10.136.159.116, port=45587
06:29:44] src/van.cc:65: Bind to role=worker, ip=10.136.159.116, port=39873

^CTraceback (most recent call last):
File "../../tools/launch.py", line 79, in
main()
File "../../tools/launch.py", line 55, in main
local.submit(args)
File "/home/slave/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/local.py", line 83, in submit
pscmd=(' '.join(args.command)))
File "/home/slave/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 422, in submit
pserver.join()
File "/home/slave/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 371, in join
self.thread.join(100)
File "/usr/lib/python2.7/threading.py", line 960, in join
self.__block.wait(delay)
File "/usr/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt
root@116:~/mxnet/example/image-classification#

@n080509
Copy link
Author

n080509 commented May 12, 2016

单机上运行 python train_mnist.py --network lenet 是OK的

@qiaohaijun
Copy link
Contributor

qiaohaijun commented May 12, 2016

你配置了ssh 无密码登录了吗?
单机多卡,不是那么配置的,你可以通过qq联系我

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants