Best model path is None and best model score is always 0 when using save_top_k=-1 #2586

pamparana34 · 2020-07-11T20:16:44Z

I am currently using lightning 0.8.4 and configuring model checkpoint and doing the training as described in the docs. However, the checkpoint best_model_path is always None and the best_model_sccore is 0.

Here is my usage:

checkpoint_callback = ModelCheckpoint(filepath=hparams['chkpt_dir'],
                                          save_top_k=-1)
trainer = Trainer(gpus=1,
                          distributed_backend='dp',
                          logger=logger,
                          early_stop_callback=early_stop_callback,
                          checkpoint_callback=checkpoint_callback,
                          min_epochs=1, max_epochs=50)
model = MyNet(params)
trainer.fit(model)
print('Best path: ', checkpoint_callback.best_model_path)  # None
print('Best score: ', checkpoint_callback.best_model_score)  # 0

I can confirm the model does indeed train. Here is the console output:

Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  2.00s/it]Mean Val loss:  0.9996303915977478
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 692/693 [02:06<00:00,  5.47it/s, loss=0.222, v_num=20]Mean Val loss:  0.21550895273685455█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.55it/s]
Epoch 2: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.39it/s, loss=0.176, v_num=20, validation_loss=0.216, step=0]Mean Val loss:  0.18335537612438202█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.50it/s]
Epoch 3: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.47it/s, loss=0.156, v_num=20, validation_loss=0.183, step=1]Mean Val loss:  0.1659010648727417██████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.67it/s]
Epoch 4: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.46it/s, loss=0.119, v_num=20, validation_loss=0.166, step=2]Mean Val loss:  0.13563179969787598█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.66it/s]
Epoch 5: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.40it/s, loss=0.120, v_num=20, validation_loss=0.136, step=3]Mean Val loss:  0.13059481978416443█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.60it/s]
Epoch 6: 100%|███████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.51it/s, loss=0.094, v_num=20, validation_loss=0.131, step=4]Mean Val loss:  0.11244599521160126█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.59it/s]
Epoch 7: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.43it/s, loss=0.089, v_num=20, validation_loss=0.112, step=5]Mean Val loss:  0.10584709048271179█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.31it/s]
Epoch 8: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.43it/s, loss=0.081, v_num=20, validation_loss=0.106, step=6]Mean Val loss:  0.09358386695384979█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.61it/s]
Epoch 9: 100%|██████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.55it/s, loss=0.072, v_num=20, validation_loss=0.0936, step=7]Mean Val loss:  0.09778335690498352█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:02<00:00,  2.73it/s]
Epoch 10: 100%|█████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.46it/s, loss=0.077, v_num=20, validation_loss=0.0978, step=8]Mean Val loss:  0.11586500704288483█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.61it/s]
Epoch 11: 100%|██████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.52it/s, loss=0.072, v_num=20, validation_loss=0.116, step=9]Mean Val loss:  0.08725202083587646█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:02<00:00,  2.76it/s]
Epoch 12: 100%|████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.49it/s, loss=0.072, v_num=20, validation_loss=0.0873, step=10]Mean Val loss:  0.1264028400182724██████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.68it/s]
Epoch 13: 100%|█████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.48it/s, loss=0.067, v_num=20, validation_loss=0.126, step=11]Mean Val loss:  0.10011263191699982█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.70it/s]
Epoch 14: 100%|███████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.60it/s, loss=0.073, v_num=20, validation_loss=0.1, step=12]Mean Val loss:  0.08872538805007935█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:02<00:00,  2.81it/s]
Epoch 14: 100%|████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.59it/s, loss=0.073, v_num=20, validation_loss=0.0887, step=13Epoch 00014: early stopping triggered.
Epoch 14: 100%|████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.58it/s, loss=0.073, v_num=20, validation_loss=0.0887, step=13]
Best path:
Best score:  0

The text was updated successfully, but these errors were encountered:

rohitgr7 · 2020-07-12T11:15:12Z

Tried on 0.8.4. Working for me. Can you share a colab notebook with code??

pamparana34 · 2020-07-12T20:30:00Z

ok, so this only happens when save_top_k=-1. If I do save_top_k=1, this seems to work. Is that expected? Surely the utility of this method only comes into play when models besides the best one are also saved?

@williamFalcon

rohitgr7 · 2020-07-12T21:00:17Z

Yeah! not working when save_top_k=-1. I think it should work regardless of save_top_k value. The problem is here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/1d565e175d98103c2ebd6164e681f76143501da9/pytorch_lightning/callbacks/model_checkpoint.py#L293
This is called only when the value is save_top_k=-1:
https://github.com/PyTorchLightning/pytorch-lightning/blob/1d565e175d98103c2ebd6164e681f76143501da9/pytorch_lightning/callbacks/model_checkpoint.py#L309
And this is updated in self._do_check_save:
https://github.com/PyTorchLightning/pytorch-lightning/blob/1d565e175d98103c2ebd6164e681f76143501da9/pytorch_lightning/callbacks/model_checkpoint.py#L337-L339

pamparana34 · 2020-07-15T12:31:29Z

@williamFalcon Should I file a more detailed bug report about this?

pamparana34 · 2020-07-27T13:46:58Z

bump: Is there any interest in this?

Borda · 2020-09-15T17:45:16Z

@pamparana34 sorry for the delay, mind test actual master, and if it is still there share a complete example to reproduce...
Ideally, an example that can be also used as a test to prevent such behavior in the future...

rohitgr7 · 2020-09-15T18:07:09Z

@Borda this is still an issue.

edenlightning · 2020-09-16T18:36:49Z

@pamparana34 could you send us a failing test case with your example?

pamparana34 added bug Something isn't working help wanted Open to be worked on labels Jul 11, 2020

Borda added the information needed label Sep 15, 2020

Borda added the priority: 0 High priority task label Sep 15, 2020

edenlightning added this to the 0.9.x milestone Sep 16, 2020

edenlightning changed the title ~~Best model path is None and best model score is always 0.~~ Best model path is None and best model score is always 0 when using save_top_k=-1 Sep 16, 2020

edenlightning added the checkpointing Related to checkpointing label Sep 21, 2020

edenlightning assigned Borda Sep 21, 2020

edenlightning added the v1.0 allowed label Sep 22, 2020

rohitgr7 mentioned this issue Sep 30, 2020

Support best model checkpoint path even if save_top_k=-1 #3732

Closed

ananthsub mentioned this issue Sep 30, 2020

Finish Allow on_save_checkpoint... #3688

Merged

awaelchli mentioned this issue Sep 30, 2020

Make ModelCheckpoint(save_top_k=-1) track the best models #3735

Merged

williamFalcon closed this as completed in #3735 Sep 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best model path is None and best model score is always 0 when using save_top_k=-1 #2586

Best model path is None and best model score is always 0 when using save_top_k=-1 #2586

pamparana34 commented Jul 11, 2020 •

edited

Loading

rohitgr7 commented Jul 12, 2020

pamparana34 commented Jul 12, 2020

rohitgr7 commented Jul 12, 2020

pamparana34 commented Jul 15, 2020

pamparana34 commented Jul 27, 2020

Borda commented Sep 15, 2020

rohitgr7 commented Sep 15, 2020

edenlightning commented Sep 16, 2020

Best model path is None and best model score is always 0 when using save_top_k=-1 #2586

Best model path is None and best model score is always 0 when using save_top_k=-1 #2586

Comments

pamparana34 commented Jul 11, 2020 • edited Loading

rohitgr7 commented Jul 12, 2020

pamparana34 commented Jul 12, 2020

rohitgr7 commented Jul 12, 2020

pamparana34 commented Jul 15, 2020

pamparana34 commented Jul 27, 2020

Borda commented Sep 15, 2020

rohitgr7 commented Sep 15, 2020

edenlightning commented Sep 16, 2020

pamparana34 commented Jul 11, 2020 •

edited

Loading