Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking checkpoint name when train_loss in checkpoint name during ddp mode #6775

Closed
liyucheng09 opened this issue Apr 1, 2021 · 4 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@liyucheng09
Copy link

liyucheng09 commented Apr 1, 2021

🐛 Bug

I wrote a customized callback to deal with checkpoint saving, which saves every n train_steps but only keeps top_k with minimal train_loss.

I want to use self.history to track saved ckpt path and its corresponding train loss. However, it turns out that the rank_0 process cannot find the ckpt file when it calls self._del_model(ckpt_path). I find that the actually saved ckpt might not be the one in rank_0's self.history.

So I'd like to ask how to find the actual saved ckpt path. Any reply is highly appreciated.

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

class SaveCallback(Callback):

    def __init__(self, save_path, save_steps=1000, save_top_k=0):
        super(SaveCallback, self).__init__()
        self.save_path=save_path
        self.save_steps=save_steps
        self.save_top_k=save_top_k
        self.history=[]

    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        if pl_module.global_step !=0 and (pl_module.global_step % self.save_steps == 0) :
            epoch=pl_module.current_epoch
            step=pl_module.global_step
            loss=trainer.callback_metrics["train_loss"].detach().item()

            if self.save_top_k and (len(self.history)==self.save_top_k):
                self.history.sort(key=lambda x: x[2])
                last_max_loss=self.history[-1][2]
                if last_max_loss > loss:
                    path_to_remove=self.history[-1][-1]
                    self._del_model(path_to_remove)

                    ckpt_name=f'epoch-{epoch}--step{step}--train_loss-{loss: .2f}'+'.ckpt'
                    trainer.save_checkpoint(self.save_path+ckpt_name)
                    self.history.append([epoch, step, loss, self.save_path+ckpt_name])
            else:
                ckpt_name=f'epoch-{epoch}--step{step}--train_loss-{loss:.2f}'+'.ckpt'
                trainer.save_checkpoint(self.save_path+ckpt_name)
                self.history.append([epoch, step, loss, self.save_path+ckpt_name])

    @rank_zero_only
    def _del_model(self, path):
        if os.path.exists(path):
            os.remove(path)
            log.debug(f'removed checkpoint: {path}.')

My lightning version is 1.2.6.

@liyucheng09 liyucheng09 added bug Something isn't working help wanted Open to be worked on labels Apr 1, 2021
@ananthsub
Copy link
Contributor

The model checkpoint in Lightning now supports this after #6146
This is available in master:
https://github.com/PyTorchLightning/pytorch-lightning/blob/13f67ad3132b67d4a6fce429731f4a4cd7eb00cc/pytorch_lightning/callbacks/model_checkpoint.py#L96-L99

You could either see if using the master branch with this setting works, or you could replicate the logic from here in your custom callback

@liyucheng09
Copy link
Author

@ananthsub thanks for your answering and your contributes about the novel ModelCheckpoint class. It really helps. But to further understand the distribution training, I didn't find out how the novel callback gets the expected behavior when the training mode is ddp.

I mean, if thetrain_loss is in the checkpoint's path (train_loss are different on different processes), the noval callback is required to knowtrain_loss of the actually saved checkpoint (it might be the train_loss of any process).

How does the novel callback do it?

@ananthsub
Copy link
Contributor

Do you see the same bug if train_loss is not in the filename?

@liyucheng09
Copy link
Author

@ananthsub sorry for the late reply. The ModelCheckpoint in the master branch works fine when I add {train_loss} in the file_name and enable save_top_k. So, I am just curious how the rank_zero determines which train_loss is saved (In other words, which file_name of each process, i.e., rank_1,2,3, is saved by the main process, i.e., rank_zero).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

2 participants