This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
distributed kvstore bug in MXNet #12713
Comments
@eric-haibin-lin Thank you for reporting the issue, |
+1 |
Working on this. Will update my findings. |
4 tasks
@sandeep-krishnamurthy is this issue fixed with #12786 ? |
This is fixed.
This needs to be fixed. |
As mentioned in #12786 the fix for the 1st problem has issues on the v1.3.x release branch. |
4 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I'm using distributed kvstore with Gluon trainer. I found the two following bugs:
Initializing
trainer = gluon.Trainer(update_on_kvstore=True)
doesn't work. Inspectingtrainer._update_on_kvstore
shows that the value is still set toFalse
.When distributed kvstore is used, by default
gluon.Trainer
doesn't work withmx.optimizer.LRScheduler
if a worker has more than 1 GPU. To be more specific, the trainer updates once per GPU, theLRScheduler
object is shared across GPUs and get a wrong update count.This means one cannot train imagenet classification using resnet with gluon trainer.
The text was updated successfully, but these errors were encountered: