-
Notifications
You must be signed in to change notification settings - Fork 6.8k
distributed training notebook tests #12363
Comments
@mxnet-label-bot [Distributed] |
Is it must be installed by source with USE_DIST_KVSTORE=1? Installed by pip didn't support distributed training? |
pip package is already built with USE_DIST_KVSTORE=1. cc @szha in case I'm wrong |
dist kvstore is already supported in pip. |
could you upgrade your pip installation to 1.2.1post1? |
@vishaalkapoor - Can we utilize your straightdope notebook testing setup here? |
@sandeep-krishnamurthy It looks like the tutorial/notebook executor will need to be wrapped by a launcher script that executes notebooks in parallel on several nodes. I modified the existing tutorial/notebook executor so that it could be used for more than tutorials (arbitrary notebooks) on a single-host. There is no scaffolding for multi-host. https://github.com/apache/incubator-mxnet/blob/ae5d60fa830090f4882a433d9b88c53c26c42b4f/tests/utils/notebook_test/__init__.py#L39 |
@meanmee looks like there's some problem connecting to your remote instance. Did you setup passwordless ssh? I suggest you move the question to discuss.mxnet.io which is monitored actively. Github issue is more for issue/bug report or task/feature requests. |
hi, thankU guys, I solved the problemsm here is my solution: |
# Create a distributed key-value store
store = kv.create('dist') hi, is it |
'dist' is equivalent to 'dist_sync'. Using the documented options is recommended. |
@mxnet-label-bot add[Distributed] |
How to leverage the existing tutorial test suite for tutorials for distributed training is not straightforward. Distributed training usually involves launcher scripts and multiple processes, as mentioned in #10955
@indhub @ThomasDelteil @sandeep-krishnamurthy
The text was updated successfully, but these errors were encountered: