fix eval hook save wrong best checkpoint bug #5341

shenmishajing · 2021-06-12T03:22:09Z

Motivation

When we enable save best in eval hook, it will link last ckpt as best ckpt. Related code as follow:
/mmdetection/mmdet/core/evaluation/eval_hooks.py

    def save_best_checkpoint(self, runner, key_score):
        best_score = runner.meta['hook_msgs'].get(
            'best_score', self.init_value_map[self.rule])
        if self.compare_func(key_score, best_score) and 'last_ckpt' in runner.meta['hook_msgs']:
            best_score = key_score
            runner.meta['hook_msgs']['best_score'] = best_score
            last_ckpt = runner.meta['hook_msgs']['last_ckpt']
            runner.meta['hook_msgs']['best_ckpt'] = last_ckpt
            mmcv.symlink(
                last_ckpt,
                osp.join(runner.work_dir, f'best_{self.key_indicator}.pth'))
            time_stamp = runner.epoch + 1 if self.by_epoch else runner.iter + 1
            self.logger.info(f'Now best checkpoint is epoch_{time_stamp}.pth.'
                             f'Best {self.key_indicator} is {best_score:0.4f}')

But, when we add checkpoint hook to runner, we use priority 70. We add eval hook use priority 50. Related code:
/mmcv/runner/base_runner.py

    def register_checkpoint_hook(self, checkpoint_config):
        if checkpoint_config is None:
            return
        if isinstance(checkpoint_config, dict):
            checkpoint_config.setdefault('type', 'CheckpointHook')
            hook = mmcv.build_from_cfg(checkpoint_config, HOOKS)
        else:
            hook = checkpoint_config
        self.register_hook(hook, priority=70)

    def register_hook(self, hook, priority='NORMAL'):
        """Register a hook into the hook list.

        The hook will be inserted into a priority queue, with the specified
        priority (See :class:`Priority` for details of priorities).
        For hooks with the same priority, they will be triggered in the same
        order as they are registered.

        Args:
            hook (:obj:`Hook`): The hook to be registered.
            priority (int or str or :obj:`Priority`): Hook priority.
                Lower value means higher priority.
        """
        assert isinstance(hook, Hook)
        if hasattr(hook, 'priority'):
            raise ValueError('"priority" is a reserved attribute for hooks')
        priority = get_priority(priority)
        hook.priority = priority
        # insert the hook to a sorted list
        inserted = False
        for i in range(len(self._hooks) - 1, -1, -1):
            if priority >= self._hooks[i].priority:
                self._hooks.insert(i + 1, hook)
                inserted = True
                break
        if not inserted:
            self._hooks.insert(0, hook)

class Priority(Enum):
    """Hook priority levels.

    +------------+------------+
    | Level      | Value      |
    +============+============+
    | HIGHEST    | 0          |
    +------------+------------+
    | VERY_HIGH  | 10         |
    +------------+------------+
    | HIGH       | 30         |
    +------------+------------+
    | NORMAL     | 50         |
    +------------+------------+
    | LOW        | 70         |
    +------------+------------+
    | VERY_LOW   | 90         |
    +------------+------------+
    | LOWEST     | 100        |
    +------------+------------+
    """

    HIGHEST = 0
    VERY_HIGH = 10
    HIGH = 30
    NORMAL = 50
    LOW = 70
    VERY_LOW = 90
    LOWEST = 100

/mmdetection/mmdet/api/train.py

        runner.register_hook(eval_hook(val_dataloader, **eval_cfg))

So, eval hook with a higher pirority than checkpoint hook, and it will be called before checkpoint hook. Therefore, it will save the last epoch ckpt instead of cur epoch ckpt when save best ckpt. Example, cur epoch is 9 epoch, and eval hook will test model with 9 epoch's weight but link 8 epoch's ckpt as best ckpt.

Modification

change
/mmdetection/mmdet/api/train.py

        runner.register_hook(eval_hook(val_dataloader, **eval_cfg))

to
/mmdetection/mmdet/api/train.py

        runner.register_hook(eval_hook(val_dataloader, **eval_cfg), priority = 80)

CLAassistant · 2021-06-12T03:22:13Z

All committers have signed the CLA.

hhaAndroid · 2021-06-15T01:05:19Z

Hi @shenmishajing Thanks for the contribution. At present, eval_hookdirectly uses the code in mmcv. Can you consider changing to the latest version and testing it?

mmdet/apis/train.py

hhaAndroid · 2021-06-30T03:26:57Z

Since this problem is fixed by open-mmlab/mmcv#1120, it can be turned off.

fix eval hook save wrong best checkpoint bug

d346b7b

hhaAndroid assigned xvjiarui Jun 15, 2021

ZwwWayne requested a review from xvjiarui June 15, 2021 07:31

ZwwWayne reviewed Jun 22, 2021

View reviewed changes

mmdet/apis/train.py Show resolved Hide resolved

hhaAndroid closed this Jun 30, 2021

shenmishajing deleted the fix_eval_hook_save_wrong_best_ckpt_bug branch August 25, 2021 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix eval hook save wrong best checkpoint bug #5341

fix eval hook save wrong best checkpoint bug #5341

shenmishajing commented Jun 12, 2021

CLAassistant commented Jun 12, 2021 •

edited

Loading

hhaAndroid commented Jun 15, 2021

hhaAndroid commented Jun 30, 2021

fix eval hook save wrong best checkpoint bug #5341

fix eval hook save wrong best checkpoint bug #5341

Conversation

shenmishajing commented Jun 12, 2021

Motivation

Modification

CLAassistant commented Jun 12, 2021 • edited Loading

hhaAndroid commented Jun 15, 2021

hhaAndroid commented Jun 30, 2021

CLAassistant commented Jun 12, 2021 •

edited

Loading