-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
update: root cause is GPU changed from 2 to 1. I mistook the instance type used in CI. see updates in #15152 |
@mxnet-label-bot Add [pr-awaiting-review, KVStore] |
docs/tutorials/python/kvstore.md
Outdated
@@ -86,7 +86,7 @@ print(a.asnumpy()) | |||
`[[ 4. 4. 4.],[ 4. 4. 4.]]`<!--notebook-skip-line--> | |||
|
|||
```python | |||
kv.push(3, mx.nd.ones(shape)) | |||
kv.push(3, mx.nd.ones(shape, contexts[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple GPUs, will this change instruct user only storing NDArray to the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an user guide page, shouldn't it be more generic to have a simple user experience? I feel it might be better to fix this in nightly pipeline than updating user guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make the tutorial simpler, just demonstrating pushing a list of cpu arrays to kvstore and pull the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apeforest had another discussion with @eric-haibin-lin , making it to a list of 4 CPU arrays will provide consistent result. If users has 1/2/3 GPUs, it will produce diffrent results compared to the value on tutorial, and may cause confusion. Also if only 1 GPU, it defeats the purpose of the tutorial.
I have added a note saying sum will only happen if value list is larger than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking the change for now
@apeforest any more concerns? I have documented the sum/aggregation behavior(only happens for >1 lists). I have tested the following nightly job passed.
log
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed one part. Please correct the gramma of a sentence.
Co-Authored-By: Lin Yuan <[email protected]>
* fix kvstore * fix * update tutorial * Update docs/tutorials/python/kvstore.md Co-Authored-By: Lin Yuan <[email protected]>
Description
fix #15152
In summary:
root cause of failure is num of GPUs changed from 2 to 1. NODE_LINUX_GPU is G3.8x wiht 2 GPUs and NODE_LINUX_GPU_P3 is P3.2x with 1 GPU
so
contexts =[mx.gpu(0), mx.gpu(1)] -> [mx.gpu(0)]
b length changed from 2 to 1
b = [mx.nd.ones(shape, ctx) for ctx in contexts]
It seems in kvstore, when pushing a list, only len list lenght >1 , aggregation happens, and everything will be on the same context during update. But when lenght = 1, the sum/aggregation won't happen, causing update with ndarray on different context failed. User should make sure ndarrays are on the same device.
more details and reproduciable code at #15152 (comment)