-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cfg.resume_from doesn't work as expected #292
Comments
What's the version of mmcv, mmdet and mmocr? |
Not sure if the recent hook priority is related. |
# Check Pytorch installation
import torch, torchvision
print(torch.__version__, torch.cuda.is_available())
# Check MMDetection installation
import mmdet
print(mmdet.__version__)
# Check mmcv installation
import mmcv
from mmcv.ops import get_compiling_cuda_version, get_compiler_version
print(mmcv.__version__)
print(get_compiling_cuda_version())
print(get_compiler_version())
# Check mmocr installation
import mmocr
print(mmocr.__version__) Output:
|
On problem 1 (i.e. Learning rate is not correctly restored), I've better read However, problem 2 still persists: all checkpoints computed during restored training don't have correct values for |
I've better investigated also on problem 2 and I've found the problem in In particular, this dict update should be done before this one. This because in a restored training setup, I'll open a PR in MMCV ASAP (hopefully tomorrow) to fix this behaviour. |
I'm gonna close this issue because it has been solved in open-mmlab/mmcv#1108 |
Describe the bug
With
tools/train.py
script, I've trainedDBNet_r50dcn
for 50 epochs on my own dataset with SGD (default params here).Below the main information on the
epoch_50.pth
checkpoint:After that, I've run
tools/train.py
withcfg.resume_from = 'epoch_50.pth'
but here I've noted 2 problems:Learning rate is not correctly restored
Instead of
0.00020712311161269083
, the learning rate on epoch 51 is6.725e-03
(which is actually not even the default value7e-03
defined here)The checkpoints created during this resumed training (e.g.
epoch_51.pth
,epoch_52.pth
,epoch_53.pth
) still continue to haveepoch
anditer
equal to the values inepoch_50.pth
For example, below the main information on the
epoch_51.pth
checkpoint:and on the
epoch_52.pth
checkpoint:Clearly, this turns out a problem If one wants to restore another time the training from one of these checkpoints.
The text was updated successfully, but these errors were encountered: