learning rate issue in LRScheduler #750

ljk628 · 2019-04-21T10:54:20Z

There seems to be a bug in LRScheduler for the initial learning rate assignment.

If we replace the line of https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py#L141 with a simple step based scheduler

lr_scheduler = LRScheduler(opt.lr_mode, base_lr=opt.lr, target_lr=0,
                    nepochs=opt.num_epochs - opt.warmup_epochs,
                    iters_per_epoch=num_batches,
                    step_epoch=lr_decay_epoch,
                    step_factor=lr_decay, power=2)

and add print(trainer.learning_rate) after https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py#L350, you will find it prints 0.01 rather than the true value of opt.lr.

This is because when the the optimizer is initialized in the trainer, it was given the default learning_rate=0.01 even lr_scheduler has a base_lr https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/optimizer/optimizer.py#L102.

One way to fix it is to add learning_rate explicitly in https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py#L166, i.e.,

optimizer_params = {'wd': opt.wd, 'momentum': opt.momentum, 'lr_scheduler': lr_scheduler, 'learning_rate': opt.lr}

However, it is not elegant as opt.lr is already provided in the lr_scheduler and should not be provided again.

The text was updated successfully, but these errors were encountered:

zhreshold · 2019-04-24T22:35:08Z

@hetong007

hetong007 · 2019-04-24T22:50:29Z

@eric-haibin-lin do you see it as a common problem for the lr scheduler in mxnet we well?

eric-haibin-lin · 2019-05-04T23:42:35Z

@ljk628 sorry for the late response. Could you provide a simple script to reproduce the issue?
The optimizer is supposed to query the lr_scheduler when lr_scheduler is set. See https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/optimizer/optimizer.py#L190-L195

The learning rate looks fine if I try the following example:

>>> import mxnet as mx; import gluoncv; lr_scheduler = gluoncv.utils.LRScheduler('step', baselr=10, targetlr=0, niters=10, nepochs=2); optim = mx.optimizer.create('sgd', lr_scheduler=lr_scheduler)
>>> optim.learning_rate
10
>>> mx.__version__
'1.4.0'
>>> gluoncv.__version__
'0.3.0'

Thanks!

ljk628 · 2019-05-07T00:30:51Z

Hi @eric-haibin-lin , I think the problem comes with gluoncv 0.4.0, here is the commands to reproduce the bug

>>> import mxnet as mx; import gluoncv; 
>>> lr_scheduler = gluoncv.utils.LRScheduler('step', base_lr=10, target_lr=0, niters=10, nepochs=2, step_epoch=[1, 2]); optim = mx.optimizer.create('sgd', lr_scheduler=lr_scheduler)
>>> optim.learning_rate
0.01
>>> mx.__version__
'1.4.0'
>>> gluoncv.__version__
'0.4.0'

I am using EC2 Deep Learning AMI 22.0 and the pre-installed MXNet (source activate mxnet_p27 or source activate mxnet_p36).

The script you provided will produce an AssertionError with the gluoncv 0.4.

File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/gluoncv/utils/lr_scheduler.py", line 92, in __init__
    assert(step_iter is not None or step_epoch is not None)

As described, the issue is that the base_lr is already set in the lr_scheduler and the optimizer is initialized with lr_scheduler. However, there is a default learning_rate parameter (0.01) for optimizer. If we don't explicitly set base_lr in the optimizer, it uses the default value (0.01).

Ideally, base_lr should be set once when the lr_scheduler is created but not again when initalize optimizer. This issue also exists for the current ImageNet finetuning script. https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py#L166

chenliu0831 · 2019-10-10T01:45:48Z

I could also confirm the issue on MxNet 1.5.0 and GluonCV 0.5.0.

import mxnet as mx 
import gluoncv 
lr_scheduler = gluoncv.utils.LRScheduler('step', base_lr=10, target_lr=0, niters=10, nepochs=2, step_epoch=[1, 2]) 
optim = mx.optimizer.create('sgd', lr_scheduler=lr_scheduler)

print(optim.learning_rate) # 0.01

zhreshold · 2019-10-10T18:31:19Z

@hetong007 Sounds like a bug in the new LRScheduler?

hetong007 · 2019-10-12T06:23:33Z

It is from these two lines: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/optimizer/optimizer.py#L106-L107

The optimizer overwrites base_lr in the lr_scheduler without checking anything. The easiest fix would be fixing that in the script. A more systematic way is to avoid that behavior in optimizer implementation but that may be an API-breaker.

@zhreshold @eric-haibin-lin any better idea?

zhreshold · 2019-10-13T21:14:53Z

I think this is awful, at least we need to have a warning to it.

chenliu0831 · 2019-10-15T02:06:55Z

@hetong007 How about add something to respect the existing learning rate from lr_scheduler at mx.optimizer and fallback to the provided learning_rate from optimizer interface? Current override behavior probably is quite surprising to many users who provided lr to the scheduler object. I wonder if there're actual users rely on the current overwrite to make it a breaking change.

hetong007 · 2019-10-18T08:36:16Z

It has been fixed in mxnet with apache/mxnet#16487.

hetong007 mentioned this issue Oct 15, 2019

Fix learning rate scheduler being unexpectedly overwritten by optimizer's default value apache/mxnet#16487

Merged

3 tasks

hetong007 closed this as completed Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

learning rate issue in LRScheduler #750

learning rate issue in LRScheduler #750

ljk628 commented Apr 21, 2019

zhreshold commented Apr 24, 2019

hetong007 commented Apr 24, 2019

eric-haibin-lin commented May 4, 2019

ljk628 commented May 7, 2019 •

edited

Loading

chenliu0831 commented Oct 10, 2019

zhreshold commented Oct 10, 2019

hetong007 commented Oct 12, 2019

zhreshold commented Oct 13, 2019

chenliu0831 commented Oct 15, 2019 •

edited

Loading

hetong007 commented Oct 18, 2019

learning rate issue in LRScheduler #750

learning rate issue in LRScheduler #750

Comments

ljk628 commented Apr 21, 2019

zhreshold commented Apr 24, 2019

hetong007 commented Apr 24, 2019

eric-haibin-lin commented May 4, 2019

ljk628 commented May 7, 2019 • edited Loading

chenliu0831 commented Oct 10, 2019

zhreshold commented Oct 10, 2019

hetong007 commented Oct 12, 2019

zhreshold commented Oct 13, 2019

chenliu0831 commented Oct 15, 2019 • edited Loading

hetong007 commented Oct 18, 2019

ljk628 commented May 7, 2019 •

edited

Loading

chenliu0831 commented Oct 15, 2019 •

edited

Loading