Gluon Trainer does not handle non-deterministic parameter order for distributed training #17056

eric-haibin-lin · 2019-12-13T00:28:46Z

Description

Currently gluon Trainer iterates over the parameter dict and assign indices for multi-machine training. The index is used to identify the gradient/parameters. This relies on a deterministic order of param dict iteration and deterministic order of parameter creation. However, that is not be true if the user's code defines parameters in a random order (e.g. https://github.com/dmlc/gluon-nlp/blob/v0.9.x/src/gluonnlp/model/attention_cell.py#L223)

Workaround for apache/mxnet#17056

eric-haibin-lin added Bug Distributed labels Dec 13, 2019

This was referenced Dec 13, 2019

[BUGFIX] Fix trainer param order #17068

Merged

[BUGFIX] avoid using dict for attention cell parameter creation dmlc/gluon-nlp#1050

Merged

eric-haibin-lin closed this as completed in #17068 Dec 14, 2019

leezu pushed a commit to dmlc/gluon-nlp that referenced this issue Dec 16, 2019

[BUGFIX] Avoid using dict for attention cell parameter creation (#1050)

394e69a

Workaround for apache/mxnet#17056

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gluon Trainer does not handle non-deterministic parameter order for distributed training #17056

Gluon Trainer does not handle non-deterministic parameter order for distributed training #17056

eric-haibin-lin commented Dec 13, 2019

Gluon Trainer does not handle non-deterministic parameter order for distributed training #17056

Gluon Trainer does not handle non-deterministic parameter order for distributed training #17056

Comments

eric-haibin-lin commented Dec 13, 2019

Description