Tracking checkpoint name when train_loss in checkpoint name during ddp mode #6775

liyucheng09 · 2021-04-01T02:16:18Z

🐛 Bug

I wrote a customized callback to deal with checkpoint saving, which saves every n train_steps but only keeps top_k with minimal train_loss.

I want to use self.history to track saved ckpt path and its corresponding train loss. However, it turns out that the rank_0 process cannot find the ckpt file when it calls self._del_model(ckpt_path). I find that the actually saved ckpt might not be the one in rank_0's self.history.

So I'd like to ask how to find the actual saved ckpt path. Any reply is highly appreciated.

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

class SaveCallback(Callback):

    def __init__(self, save_path, save_steps=1000, save_top_k=0):
        super(SaveCallback, self).__init__()
        self.save_path=save_path
        self.save_steps=save_steps
        self.save_top_k=save_top_k
        self.history=[]

    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        if pl_module.global_step !=0 and (pl_module.global_step % self.save_steps == 0) :
            epoch=pl_module.current_epoch
            step=pl_module.global_step
            loss=trainer.callback_metrics["train_loss"].detach().item()

            if self.save_top_k and (len(self.history)==self.save_top_k):
                self.history.sort(key=lambda x: x[2])
                last_max_loss=self.history[-1][2]
                if last_max_loss > loss:
                    path_to_remove=self.history[-1][-1]
                    self._del_model(path_to_remove)

                    ckpt_name=f'epoch-{epoch}--step{step}--train_loss-{loss: .2f}'+'.ckpt'
                    trainer.save_checkpoint(self.save_path+ckpt_name)
                    self.history.append([epoch, step, loss, self.save_path+ckpt_name])
            else:
                ckpt_name=f'epoch-{epoch}--step{step}--train_loss-{loss:.2f}'+'.ckpt'
                trainer.save_checkpoint(self.save_path+ckpt_name)
                self.history.append([epoch, step, loss, self.save_path+ckpt_name])

    @rank_zero_only
    def _del_model(self, path):
        if os.path.exists(path):
            os.remove(path)
            log.debug(f'removed checkpoint: {path}.')

My lightning version is 1.2.6.

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-04-01T03:11:33Z

The model checkpoint in Lightning now supports this after #6146
This is available in master:
https://github.com/PyTorchLightning/pytorch-lightning/blob/13f67ad3132b67d4a6fce429731f4a4cd7eb00cc/pytorch_lightning/callbacks/model_checkpoint.py#L96-L99

You could either see if using the master branch with this setting works, or you could replicate the logic from here in your custom callback

liyucheng09 · 2021-04-01T08:48:27Z

@ananthsub thanks for your answering and your contributes about the novel ModelCheckpoint class. It really helps. But to further understand the distribution training, I didn't find out how the novel callback gets the expected behavior when the training mode is ddp.

I mean, if thetrain_loss is in the checkpoint's path (train_loss are different on different processes), the noval callback is required to knowtrain_loss of the actually saved checkpoint (it might be the train_loss of any process).

How does the novel callback do it?

ananthsub · 2021-04-01T09:38:51Z

Do you see the same bug if train_loss is not in the filename?

liyucheng09 · 2021-04-08T09:32:12Z

@ananthsub sorry for the late reply. The ModelCheckpoint in the master branch works fine when I add {train_loss} in the file_name and enable save_top_k. So, I am just curious how the rank_zero determines which train_loss is saved (In other words, which file_name of each process, i.e., rank_1,2,3, is saved by the main process, i.e., rank_zero).

liyucheng09 added bug Something isn't working help wanted Open to be worked on labels Apr 1, 2021

Borda assigned ananthsub Apr 1, 2021

liyucheng09 closed this as completed Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking checkpoint name when train_loss in checkpoint name during ddp mode #6775

Tracking checkpoint name when train_loss in checkpoint name during ddp mode #6775

liyucheng09 commented Apr 1, 2021 •

edited

Loading

ananthsub commented Apr 1, 2021

liyucheng09 commented Apr 1, 2021

ananthsub commented Apr 1, 2021

liyucheng09 commented Apr 8, 2021

Tracking checkpoint name when train_loss in checkpoint name during ddp mode #6775

Tracking checkpoint name when train_loss in checkpoint name during ddp mode #6775

Comments

liyucheng09 commented Apr 1, 2021 • edited Loading

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

ananthsub commented Apr 1, 2021

liyucheng09 commented Apr 1, 2021

ananthsub commented Apr 1, 2021

liyucheng09 commented Apr 8, 2021

liyucheng09 commented Apr 1, 2021 •

edited

Loading