Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best model path is None and best model score is always 0 when using save_top_k=-1 #2586

Closed
pamparana34 opened this issue Jul 11, 2020 · 8 comments · Fixed by #3735
Closed
Assignees
Labels
bug Something isn't working checkpointing Related to checkpointing help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@pamparana34
Copy link

pamparana34 commented Jul 11, 2020

I am currently using lightning 0.8.4 and configuring model checkpoint and doing the training as described in the docs. However, the checkpoint best_model_path is always None and the best_model_sccore is 0.

Here is my usage:

checkpoint_callback = ModelCheckpoint(filepath=hparams['chkpt_dir'],
                                          save_top_k=-1)
trainer = Trainer(gpus=1,
                          distributed_backend='dp',
                          logger=logger,
                          early_stop_callback=early_stop_callback,
                          checkpoint_callback=checkpoint_callback,
                          min_epochs=1, max_epochs=50)
model = MyNet(params)
trainer.fit(model)
print('Best path: ', checkpoint_callback.best_model_path)  # None
print('Best score: ', checkpoint_callback.best_model_score)  # 0

I can confirm the model does indeed train. Here is the console output:

Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  2.00s/it]Mean Val loss:  0.9996303915977478
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 692/693 [02:06<00:00,  5.47it/s, loss=0.222, v_num=20]Mean Val loss:  0.21550895273685455█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.55it/s]
Epoch 2: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.39it/s, loss=0.176, v_num=20, validation_loss=0.216, step=0]Mean Val loss:  0.18335537612438202█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.50it/s]
Epoch 3: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.47it/s, loss=0.156, v_num=20, validation_loss=0.183, step=1]Mean Val loss:  0.1659010648727417██████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.67it/s]
Epoch 4: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.46it/s, loss=0.119, v_num=20, validation_loss=0.166, step=2]Mean Val loss:  0.13563179969787598█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.66it/s]
Epoch 5: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.40it/s, loss=0.120, v_num=20, validation_loss=0.136, step=3]Mean Val loss:  0.13059481978416443█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.60it/s]
Epoch 6: 100%|███████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.51it/s, loss=0.094, v_num=20, validation_loss=0.131, step=4]Mean Val loss:  0.11244599521160126█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.59it/s]
Epoch 7: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.43it/s, loss=0.089, v_num=20, validation_loss=0.112, step=5]Mean Val loss:  0.10584709048271179█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.31it/s]
Epoch 8: 100%|███████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.43it/s, loss=0.081, v_num=20, validation_loss=0.106, step=6]Mean Val loss:  0.09358386695384979█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.61it/s]
Epoch 9: 100%|██████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.55it/s, loss=0.072, v_num=20, validation_loss=0.0936, step=7]Mean Val loss:  0.09778335690498352█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:02<00:00,  2.73it/s]
Epoch 10: 100%|█████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.46it/s, loss=0.077, v_num=20, validation_loss=0.0978, step=8]Mean Val loss:  0.11586500704288483█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.61it/s]
Epoch 11: 100%|██████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.52it/s, loss=0.072, v_num=20, validation_loss=0.116, step=9]Mean Val loss:  0.08725202083587646█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:02<00:00,  2.76it/s]
Epoch 12: 100%|████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.49it/s, loss=0.072, v_num=20, validation_loss=0.0873, step=10]Mean Val loss:  0.1264028400182724██████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.68it/s]
Epoch 13: 100%|█████████████████████████████████████████████████████| 693/693 [01:06<00:00, 10.48it/s, loss=0.067, v_num=20, validation_loss=0.126, step=11]Mean Val loss:  0.10011263191699982█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:03<00:00,  2.70it/s]
Epoch 14: 100%|███████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.60it/s, loss=0.073, v_num=20, validation_loss=0.1, step=12]Mean Val loss:  0.08872538805007935█████████████████████████████████████████████████████████████████████████████▍             | 7/8 [00:02<00:00,  2.81it/s]
Epoch 14: 100%|████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.59it/s, loss=0.073, v_num=20, validation_loss=0.0887, step=13Epoch 00014: early stopping triggered.
Epoch 14: 100%|████████████████████████████████████████████████████| 693/693 [01:05<00:00, 10.58it/s, loss=0.073, v_num=20, validation_loss=0.0887, step=13]
Best path:
Best score:  0
@pamparana34 pamparana34 added bug Something isn't working help wanted Open to be worked on labels Jul 11, 2020
@rohitgr7
Copy link
Contributor

Tried on 0.8.4. Working for me. Can you share a colab notebook with code??

Screenshot from 2020-07-12 16-43-04

@pamparana34
Copy link
Author

ok, so this only happens when save_top_k=-1. If I do save_top_k=1, this seems to work. Is that expected? Surely the utility of this method only comes into play when models besides the best one are also saved?

@williamFalcon

@pamparana34
Copy link
Author

@williamFalcon Should I file a more detailed bug report about this?

@pamparana34
Copy link
Author

bump: Is there any interest in this?

@Borda
Copy link
Member

Borda commented Sep 15, 2020

@pamparana34 sorry for the delay, mind test actual master, and if it is still there share a complete example to reproduce...
Ideally, an example that can be also used as a test to prevent such behavior in the future...

@rohitgr7
Copy link
Contributor

@Borda this is still an issue.

@Borda Borda added the priority: 0 High priority task label Sep 15, 2020
@edenlightning edenlightning added this to the 0.9.x milestone Sep 16, 2020
@edenlightning
Copy link
Contributor

@pamparana34 could you send us a failing test case with your example?

@edenlightning edenlightning changed the title Best model path is None and best model score is always 0. Best model path is None and best model score is always 0 when using save_top_k=-1 Sep 16, 2020
@edenlightning edenlightning added the checkpointing Related to checkpointing label Sep 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants