Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

distributed training notebook tests #12363

Open
eric-haibin-lin opened this issue Aug 26, 2018 · 12 comments
Open

distributed training notebook tests #12363

eric-haibin-lin opened this issue Aug 26, 2018 · 12 comments

Comments

@eric-haibin-lin
Copy link
Member

How to leverage the existing tutorial test suite for tutorials for distributed training is not straightforward. Distributed training usually involves launcher scripts and multiple processes, as mentioned in #10955

@indhub @ThomasDelteil @sandeep-krishnamurthy

@ankkhedia
Copy link
Contributor

@mxnet-label-bot [Distributed]

@meanmee
Copy link

meanmee commented Aug 27, 2018

Is it must be installed by source with USE_DIST_KVSTORE=1? Installed by pip didn't support distributed training?

@eric-haibin-lin
Copy link
Member Author

pip package is already built with USE_DIST_KVSTORE=1. cc @szha in case I'm wrong

@szha
Copy link
Member

szha commented Aug 27, 2018

dist kvstore is already supported in pip.

@szha
Copy link
Member

szha commented Aug 27, 2018

could you upgrade your pip installation to 1.2.1post1?

@sandeep-krishnamurthy
Copy link
Contributor

@vishaalkapoor - Can we utilize your straightdope notebook testing setup here?

@vishaalkapoor
Copy link
Contributor

vishaalkapoor commented Aug 28, 2018

@sandeep-krishnamurthy It looks like the tutorial/notebook executor will need to be wrapped by a launcher script that executes notebooks in parallel on several nodes.

I modified the existing tutorial/notebook executor so that it could be used for more than tutorials (arbitrary notebooks) on a single-host. There is no scaffolding for multi-host. https://github.com/apache/incubator-mxnet/blob/ae5d60fa830090f4882a433d9b88c53c26c42b4f/tests/utils/notebook_test/__init__.py#L39

@eric-haibin-lin
Copy link
Member Author

@meanmee looks like there's some problem connecting to your remote instance. Did you setup passwordless ssh? I suggest you move the question to discuss.mxnet.io which is monitored actively. Github issue is more for issue/bug report or task/feature requests.

@meanmee
Copy link

meanmee commented Aug 29, 2018

hi, thankU guys, I solved the problemsm here is my solution:
https://shimo.im/docs/JwobIyIK8ucMgc3r/

@YaoC
Copy link

YaoC commented Nov 23, 2018

# Create a distributed key-value store
store = kv.create('dist')

hi, is it dist the proper param of kv.create? As it is not in the parameters list.
https://mxnet.incubator.apache.org/versions/master/api/python/kvstore/kvstore.html#mxnet.kvstore.create

@eric-haibin-lin
Copy link
Member Author

'dist' is equivalent to 'dist_sync'. Using the documented options is recommended.

@pinaraws
Copy link

@mxnet-label-bot add[Distributed]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants