You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote a customized callback to deal with checkpoint saving, which saves every n train_steps but only keeps top_k with minimal train_loss.
I want to use self.history to track saved ckpt path and its corresponding train loss. However, it turns out that the rank_0 process cannot find the ckpt file when it calls self._del_model(ckpt_path). I find that the actually saved ckpt might not be the one in rank_0's self.history.
So I'd like to ask how to find the actual saved ckpt path. Any reply is highly appreciated.
@ananthsub thanks for your answering and your contributes about the novel ModelCheckpoint class. It really helps. But to further understand the distribution training, I didn't find out how the novel callback gets the expected behavior when the training mode is ddp.
I mean, if thetrain_loss is in the checkpoint's path (train_loss are different on different processes), the noval callback is required to knowtrain_loss of the actually saved checkpoint (it might be the train_loss of any process).
@ananthsub sorry for the late reply. The ModelCheckpoint in the master branch works fine when I add {train_loss} in the file_name and enable save_top_k. So, I am just curious how the rank_zero determines which train_loss is saved (In other words, which file_name of each process, i.e., rank_1,2,3, is saved by the main process, i.e., rank_zero).
🐛 Bug
I wrote a customized callback to deal with checkpoint saving, which saves every n train_steps but only keeps top_k with minimal train_loss.
I want to use self.history to track saved ckpt path and its corresponding train loss. However, it turns out that the rank_0 process cannot find the ckpt file when it calls self._del_model(ckpt_path). I find that the actually saved ckpt might not be the one in rank_0's self.history.
So I'd like to ask how to find the actual saved ckpt path. Any reply is highly appreciated.
Please reproduce using the BoringModel
To Reproduce
Use following BoringModel and post here
My lightning version is
1.2.6
.The text was updated successfully, but these errors were encountered: