-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix last checkpoint finding in filtered files with correct extension #17072
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a simple test for the added logic and a CHANGELOG entry?
@carmocca Where are the tests for the |
They are in |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #17072 +/- ##
==========================================
- Coverage 84% 49% -35%
==========================================
Files 443 435 -8
Lines 36154 36007 -147
==========================================
- Hits 30252 17610 -12642
- Misses 5902 18397 +12495 |
for more information, see https://pre-commit.ci
Hi. At the moment I don't have the capacity.
…On Mon, Nov 20, 2023 at 7:22 AM Carlos Mocholí ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/lightning/pytorch/callbacks/model_checkpoint.py
<#17072 (comment)>
:
> if self.CHECKPOINT_NAME_LAST in os.path.split(p)[1]
+ and os.path.split(p)[1].endswith(self.FILE_EXTENSION)
@yassersouri <https://github.com/yassersouri> are you still interested in
finishing this PR?
—
Reply to this email directly, view it on GitHub
<#17072 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACGSB55HY4ZSSN2K4WKQDDYFNYRZAVCNFSM6AAAAAAVZ4ZX6SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONBQGA4DAMRYHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…17072) Co-authored-by: awaelchli <[email protected]> (cherry picked from commit 67d3844)
…17072) Co-authored-by: awaelchli <[email protected]> (cherry picked from commit 67d3844)
What does this PR do?
Simple change. The last checkpoint candidate finding code, would look at all the files with "last" in their name.
But we could use a better filtering mechanism to include files that have "last" in their name and have the correct extension.
(since the fix is very simple, I didn't create an issue for it)
I don't think this breaks anything.
What was my issue?
After the training finished, I was saving some other files in the directory with "last" and "best" in their names. These files were tsv, text or other kind of files.
After the save, the testing using
ckpt_path="last"
would stop working and I noticed in the logs that one of the files that I had created was being loaded instead of the correct "last.ckpt" file.I have tested locally and with this simple change, this issue for me is resolved.