Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon Trainer does not handle non-deterministic parameter order for distributed training #17056

Closed
eric-haibin-lin opened this issue Dec 13, 2019 · 0 comments · Fixed by #17068

Comments

@eric-haibin-lin
Copy link
Member

Description

Currently gluon Trainer iterates over the parameter dict and assign indices for multi-machine training. The index is used to identify the gradient/parameters. This relies on a deterministic order of param dict iteration and deterministic order of parameter creation. However, that is not be true if the user's code defines parameters in a random order (e.g. https://github.com/dmlc/gluon-nlp/blob/v0.9.x/src/gluonnlp/model/attention_cell.py#L223)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant