Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fix kvstore nightly failure #15156

Merged
merged 4 commits into from
Jun 6, 2019
Merged

fix kvstore nightly failure #15156

merged 4 commits into from
Jun 6, 2019

Conversation

roywei
Copy link
Member

@roywei roywei commented Jun 5, 2019

Description

fix #15152

In summary:
root cause of failure is num of GPUs changed from 2 to 1. NODE_LINUX_GPU is G3.8x wiht 2 GPUs and NODE_LINUX_GPU_P3 is P3.2x with 1 GPU

so

contexts =[mx.gpu(0), mx.gpu(1)] -> [mx.gpu(0)]
b length changed from 2 to 1
b = [mx.nd.ones(shape, ctx) for ctx in contexts]

It seems in kvstore, when pushing a list, only len list lenght >1 , aggregation happens, and everything will be on the same context during update. But when lenght = 1, the sum/aggregation won't happen, causing update with ndarray on different context failed. User should make sure ndarrays are on the same device.

more details and reproduciable code at #15152 (comment)

@roywei roywei requested a review from szha as a code owner June 5, 2019 06:48
@roywei
Copy link
Member Author

roywei commented Jun 5, 2019

update: root cause is GPU changed from 2 to 1. I mistook the instance type used in CI. see updates in #15152

@piyushghai
Copy link
Contributor

@mxnet-label-bot Add [pr-awaiting-review, KVStore]

@marcoabreu marcoabreu added KVStore pr-awaiting-review PR is waiting for code review labels Jun 5, 2019
@@ -86,7 +86,7 @@ print(a.asnumpy())
`[[ 4. 4. 4.],[ 4. 4. 4.]]`<!--notebook-skip-line-->

```python
kv.push(3, mx.nd.ones(shape))
kv.push(3, mx.nd.ones(shape, contexts[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple GPUs, will this change instruct user only storing NDArray to the first one?

Copy link
Contributor

@apeforest apeforest Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an user guide page, shouldn't it be more generic to have a simple user experience? I feel it might be better to fix this in nightly pipeline than updating user guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the tutorial simpler, just demonstrating pushing a list of cpu arrays to kvstore and pull the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apeforest had another discussion with @eric-haibin-lin , making it to a list of 4 CPU arrays will provide consistent result. If users has 1/2/3 GPUs, it will produce diffrent results compared to the value on tutorial, and may cause confusion. Also if only 1 GPU, it defeats the purpose of the tutorial.
I have added a note saying sum will only happen if value list is larger than 1.

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking the change for now

@roywei
Copy link
Member Author

roywei commented Jun 6, 2019

@apeforest any more concerns? I have documented the sum/aggregation behavior(only happens for >1 lists).
I have also taken another look at API doc here: https://mxnet.incubator.apache.org/api/python/kvstore/kvstore.html
It says clearly Direct interactions with KVStore are dangerous and not recommended., so maybe we should not confuse user in the tutorial about more complicated/unstable use cases. Also in the API doc, it's always 4 GPU, detecting the number of GPUs dynamically and run the same code in this tutorial can be confusing for users.

I have tested the following nightly job passed.

ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 1500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu

log

----------------------------------------------------------------------
Ran 48 tests in 1725.072s

OK
build.py: 2019-06-06 00:18:09,787Z INFO Waiting for status of container 2af8c3bb1fdb for 600 s$
build.py: 2019-06-06 00:18:09,950Z INFO Container exit status: {'Error': None, 'StatusCode': 0$
build.py: 2019-06-06 00:18:09,950Z INFO Container exited with success 👍
build.py: 2019-06-06 00:18:09,950Z INFO Stopping container: 2af8c3bb1fdb
build.py: 2019-06-06 00:18:09,952Z INFO Removing container: 2af8c3bb1fdb

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the update

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed one part. Please correct the gramma of a sentence.

@apeforest apeforest merged commit 3f4f3d5 into apache:master Jun 6, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* fix kvstore

* fix

* update tutorial

* Update docs/tutorials/python/kvstore.md

Co-Authored-By: Lin Yuan <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
KVStore pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI][nightly] nightly test tutorial failure: test_tutorials.test_python_kvstore
5 participants