fix kvstore nightly failure #15156

roywei · 2019-06-05T06:48:36Z

Description

In summary:
root cause of failure is num of GPUs changed from 2 to 1. NODE_LINUX_GPU is G3.8x wiht 2 GPUs and NODE_LINUX_GPU_P3 is P3.2x with 1 GPU

so

contexts =[mx.gpu(0), mx.gpu(1)] -> [mx.gpu(0)]
b length changed from 2 to 1
b = [mx.nd.ones(shape, ctx) for ctx in contexts]

It seems in kvstore, when pushing a list, only len list lenght >1 , aggregation happens, and everything will be on the same context during update. But when lenght = 1, the sum/aggregation won't happen, causing update with ndarray on different context failed. User should make sure ndarrays are on the same device.

more details and reproduciable code at #15152 (comment)

roywei · 2019-06-05T18:08:12Z

update: root cause is GPU changed from 2 to 1. I mistook the instance type used in CI. see updates in #15152

piyushghai · 2019-06-05T18:37:12Z

@mxnet-label-bot Add [pr-awaiting-review, KVStore]

apeforest · 2019-06-05T22:11:17Z

docs/tutorials/python/kvstore.md

@@ -86,7 +86,7 @@ print(a.asnumpy())
 `[[ 4.  4.  4.],[ 4.  4.  4.]]`<!--notebook-skip-line-->

 ```python
-kv.push(3, mx.nd.ones(shape))
+kv.push(3, mx.nd.ones(shape, contexts[0]))


If there are multiple GPUs, will this change instruct user only storing NDArray to the first one?

Since this is an user guide page, shouldn't it be more generic to have a simple user experience? I feel it might be better to fix this in nightly pipeline than updating user guide.

I think we should make the tutorial simpler, just demonstrating pushing a list of cpu arrays to kvstore and pull the result.

@apeforest had another discussion with @eric-haibin-lin , making it to a list of 4 CPU arrays will provide consistent result. If users has 1/2/3 GPUs, it will produce diffrent results compared to the value on tutorial, and may cause confusion. Also if only 1 GPU, it defeats the purpose of the tutorial.
I have added a note saying sum will only happen if value list is larger than 1.

apeforest

blocking the change for now

roywei · 2019-06-06T04:48:11Z

@apeforest any more concerns? I have documented the sum/aggregation behavior(only happens for >1 lists).
I have also taken another look at API doc here: https://mxnet.incubator.apache.org/api/python/kvstore/kvstore.html
It says clearly Direct interactions with KVStore are dangerous and not recommended., so maybe we should not confuse user in the tutorial about more complicated/unstable use cases. Also in the API doc, it's always 4 GPU, detecting the number of GPUs dynamically and run the same code in this tutorial can be confusing for users.

I have tested the following nightly job passed.

ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 1500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu

log

----------------------------------------------------------------------
Ran 48 tests in 1725.072s

OK
build.py: 2019-06-06 00:18:09,787Z INFO Waiting for status of container 2af8c3bb1fdb for 600 s$
build.py: 2019-06-06 00:18:09,950Z INFO Container exit status: {'Error': None, 'StatusCode': 0$
build.py: 2019-06-06 00:18:09,950Z INFO Container exited with success 👍
build.py: 2019-06-06 00:18:09,950Z INFO Stopping container: 2af8c3bb1fdb
build.py: 2019-06-06 00:18:09,952Z INFO Removing container: 2af8c3bb1fdb

apeforest

LGTM. Thanks for the update

docs/tutorials/python/kvstore.md

apeforest

Sorry, missed one part. Please correct the gramma of a sentence.

Co-Authored-By: Lin Yuan <[email protected]>

* fix kvstore * fix * update tutorial * Update docs/tutorials/python/kvstore.md Co-Authored-By: Lin Yuan <[email protected]>

roywei requested a review from szha as a code owner June 5, 2019 06:48

marcoabreu added KVStore pr-awaiting-review PR is waiting for code review labels Jun 5, 2019

apeforest reviewed Jun 5, 2019

View reviewed changes

roywei added 2 commits June 5, 2019 15:18

fix kvstore

4b35bee

fix

6840779

apeforest suggested changes Jun 5, 2019

View reviewed changes

update tutorial

cbc076e

roywei force-pushed the fix_kvstore branch from 7f11f76 to cbc076e Compare June 5, 2019 23:49

eric-haibin-lin approved these changes Jun 6, 2019

View reviewed changes

apeforest approved these changes Jun 6, 2019

View reviewed changes

apeforest reviewed Jun 6, 2019

View reviewed changes

docs/tutorials/python/kvstore.md Outdated Show resolved Hide resolved

apeforest suggested changes Jun 6, 2019

View reviewed changes

Update docs/tutorials/python/kvstore.md

103505c

Co-Authored-By: Lin Yuan <[email protected]>

apeforest approved these changes Jun 6, 2019

View reviewed changes

apeforest merged commit 3f4f3d5 into apache:master Jun 6, 2019

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

fix kvstore nightly failure (apache#15156)

aceaf52

* fix kvstore * fix * update tutorial * Update docs/tutorials/python/kvstore.md Co-Authored-By: Lin Yuan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix kvstore nightly failure #15156

fix kvstore nightly failure #15156

roywei commented Jun 5, 2019 •

edited

Loading

roywei commented Jun 5, 2019 •

edited

Loading

piyushghai commented Jun 5, 2019

apeforest Jun 5, 2019

apeforest Jun 5, 2019 •

edited

Loading

eric-haibin-lin Jun 5, 2019

roywei Jun 5, 2019

apeforest left a comment

roywei commented Jun 6, 2019 •

edited

Loading

apeforest left a comment

apeforest left a comment

fix kvstore nightly failure #15156

fix kvstore nightly failure #15156

Conversation

roywei commented Jun 5, 2019 • edited Loading

Description

roywei commented Jun 5, 2019 • edited Loading

piyushghai commented Jun 5, 2019

apeforest Jun 5, 2019

Choose a reason for hiding this comment

apeforest Jun 5, 2019 • edited Loading

Choose a reason for hiding this comment

eric-haibin-lin Jun 5, 2019

Choose a reason for hiding this comment

roywei Jun 5, 2019

Choose a reason for hiding this comment

apeforest left a comment

Choose a reason for hiding this comment

roywei commented Jun 6, 2019 • edited Loading

apeforest left a comment

Choose a reason for hiding this comment

apeforest left a comment

Choose a reason for hiding this comment

roywei commented Jun 5, 2019 •

edited

Loading

roywei commented Jun 5, 2019 •

edited

Loading

apeforest Jun 5, 2019 •

edited

Loading

roywei commented Jun 6, 2019 •

edited

Loading