distributed kvstore bug in MXNet #12713

eric-haibin-lin · 2018-10-01T23:16:49Z

I'm using distributed kvstore with Gluon trainer. I found the two following bugs:

Initializing trainer = gluon.Trainer(update_on_kvstore=True) doesn't work. Inspecting trainer._update_on_kvstore shows that the value is still set to False.
When distributed kvstore is used, by default gluon.Trainer doesn't work with mx.optimizer.LRScheduler if a worker has more than 1 GPU. To be more specific, the trainer updates once per GPU, the LRScheduler object is shared across GPUs and get a wrong update count.

This means one cannot train imagenet classification using resnet with gluon trainer.

The text was updated successfully, but these errors were encountered:

vrakesh · 2018-10-01T23:23:20Z

@eric-haibin-lin Thank you for reporting the issue,
@mxnet-label-bot [Bug, Gluon]

ragavvenkatesan · 2018-10-02T16:16:12Z

+1

sandeep-krishnamurthy · 2018-10-08T22:48:45Z

Working on this. Will update my findings.

lupesko · 2018-11-05T07:02:34Z

@sandeep-krishnamurthy is this issue fixed with #12786 ?
If so - please close the issue or comment.

sandeep-krishnamurthy · 2018-11-05T07:04:03Z

Initializing trainer = gluon.Trainer(update_on_kvstore=True) doesn't work. Inspecting trainer._update_on_kvstore shows that the value is still set to False.

This is fixed.

When distributed kvstore is used, by default gluon.Trainer doesn't work with mx.optimizer.LRScheduler if a worker has more than 1 GPU. To be more specific, the trainer updates once per GPU, the LRScheduler object is shared across GPUs and get a wrong update count.

This needs to be fixed.

lebeg · 2018-11-12T21:29:12Z

As mentioned in #12786 the fix for the 1st problem has issues on the v1.3.x release branch.

marcoabreu added Bug Gluon labels Oct 1, 2018

sandeep-krishnamurthy mentioned this issue Oct 10, 2018

Set correct update on kvstore flag in dist_device_sync mode #12786

Merged

4 tasks

ptrendx mentioned this issue Mar 9, 2019

Correct update count with Gluon trainer and update_on_kvstore=False #14377

Merged

4 tasks

eric-haibin-lin closed this as completed in #14377 Mar 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed kvstore bug in MXNet #12713

distributed kvstore bug in MXNet #12713

eric-haibin-lin commented Oct 1, 2018

vrakesh commented Oct 1, 2018

ragavvenkatesan commented Oct 2, 2018

sandeep-krishnamurthy commented Oct 8, 2018

lupesko commented Nov 5, 2018

sandeep-krishnamurthy commented Nov 5, 2018

lebeg commented Nov 12, 2018

distributed kvstore bug in MXNet #12713

distributed kvstore bug in MXNet #12713

Comments

eric-haibin-lin commented Oct 1, 2018

vrakesh commented Oct 1, 2018

ragavvenkatesan commented Oct 2, 2018

sandeep-krishnamurthy commented Oct 8, 2018

lupesko commented Nov 5, 2018

sandeep-krishnamurthy commented Nov 5, 2018

lebeg commented Nov 12, 2018