Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix last checkpoint finding in filtered files with correct extension #17072

Merged
merged 14 commits into from
Nov 21, 2023

Conversation

yassersouri
Copy link
Contributor

@yassersouri yassersouri commented Mar 14, 2023

What does this PR do?

Simple change. The last checkpoint candidate finding code, would look at all the files with "last" in their name.
But we could use a better filtering mechanism to include files that have "last" in their name and have the correct extension.

(since the fix is very simple, I didn't create an issue for it)

I don't think this breaks anything.

What was my issue?

After the training finished, I was saving some other files in the directory with "last" and "best" in their names. These files were tsv, text or other kind of files.
After the save, the testing using ckpt_path="last" would stop working and I noticed in the logs that one of the files that I had created was being loaded instead of the correct "last.ckpt" file.
I have tested locally and with this simple change, this issue for me is resolved.

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Mar 14, 2023
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a simple test for the added logic and a CHANGELOG entry?

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@carmocca carmocca added bug Something isn't working callback: model checkpoint labels Mar 14, 2023
@carmocca carmocca added this to the v1.9.x milestone Mar 14, 2023
@carmocca carmocca added the community This PR is from the community label Mar 14, 2023
@yassersouri
Copy link
Contributor Author

yassersouri commented Mar 14, 2023

@carmocca Where are the tests for the model_checkpoint.py? I couldn't find it in tests/tests_pytorch/callbacks/.

@awaelchli awaelchli changed the title Fix last checkpoint finding by only looking at files that have the correct extention Fix last checkpoint finding by only looking at files that have the correct extension Mar 21, 2023
@carmocca
Copy link
Contributor

They are in tests/tests_pytorch/checkpointing/test_model_checkpoint.py

@Borda Borda changed the title Fix last checkpoint finding by only looking at files that have the correct extension Fix last checkpoint finding in filtered files with correct extension Apr 28, 2023
Copy link

codecov bot commented Nov 18, 2023

Codecov Report

Merging #17072 (abed697) into master (4789905) will decrease coverage by 35%.
Report is 7 commits behind head on master.
The diff coverage is 100%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #17072      +/-   ##
==========================================
- Coverage      84%      49%     -35%     
==========================================
  Files         443      435       -8     
  Lines       36154    36007     -147     
==========================================
- Hits        30252    17610   -12642     
- Misses       5902    18397   +12495     

@awaelchli awaelchli requested a review from carmocca November 19, 2023 20:08
@mergify mergify bot added the ready PRs ready to be merged label Nov 19, 2023
@yassersouri
Copy link
Contributor Author

yassersouri commented Nov 20, 2023 via email

@awaelchli awaelchli self-assigned this Nov 20, 2023
@awaelchli awaelchli merged commit 67d3844 into Lightning-AI:master Nov 21, 2023
89 of 90 checks passed
Borda pushed a commit that referenced this pull request Dec 19, 2023
lantiga pushed a commit that referenced this pull request Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working callback: model checkpoint community This PR is from the community pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants