Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save ModelCheckpoint's last.ckpt as symlink if possible #18748

Merged
merged 16 commits into from
Oct 11, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Oct 8, 2023

What does this PR do?

Fixes #18670
Fixes #14973
Part of #4335

This PR changes the ModelCheckpoint's behavior when save_last=True:

  • If save_top_k != 0, and save_last=True, then last.ckpt will be a symlink to the latest top-k checkpoint
  • If save_top_k == 0, then last.ckpt remains a regular checkpoint
  • On remote filesystems, we still save a copy because symlinks can't be handled and we leave the exploration of creative solutions for future work

This improves the user experience of having a deterministic file name to load the last checkpoint. LLM checkpoints can be > 100 GB, and saving a copy everytime is not only time consuming but also wasting disk space.

cc @Borda @carmocca @awaelchli

@awaelchli awaelchli added callback: model checkpoint fun Staff contributions outside working hours - to differentiate from the "community" label labels Oct 8, 2023
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 8, 2023
@awaelchli awaelchli added feature Is an improvement or enhancement and removed pl Generic label for PyTorch Lightning package labels Oct 8, 2023
@awaelchli awaelchli changed the title Save last.ckpt as a symlink in ModelCheckpoint Save ModelCheckpoint's last.ckpt as symlink Oct 8, 2023
@awaelchli awaelchli changed the title Save ModelCheckpoint's last.ckpt as symlink Save ModelCheckpoint's last.ckpt as symlink if possible Oct 8, 2023
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 8, 2023
@awaelchli awaelchli marked this pull request as ready for review October 8, 2023 18:14
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.12, oldest) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.10, 2.1) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.12, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1) success
pl-cpu (windows-2022, lightning, 3.8, 1.12, oldest) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.10, 2.1) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success
pl-cpu (macOS-12, pytorch, 3.11, 2.0) success
pl-cpu (macOS-12, pytorch, 3.11, 2.1) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.1) success
pl-cpu (windows-2022, pytorch, 3.11, 2.0) success
pl-cpu (windows-2022, pytorch, 3.11, 2.1) success

These checks are required after the changes to src/lightning/pytorch/callbacks/model_checkpoint.py, tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py, tests/tests_pytorch/checkpointing/test_model_checkpoint.py, tests/tests_pytorch/models/test_restore.py, tests/tests_pytorch/plugins/test_checkpoint_io_plugin.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
[pytorch-lightning (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=178891&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success
[pytorch-lightning (GPUs) (testing PyTorch latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=178891&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success

These checks are required after the changes to src/lightning/pytorch/callbacks/model_checkpoint.py, tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py, tests/tests_pytorch/checkpointing/test_model_checkpoint.py, tests/tests_pytorch/models/test_restore.py, tests/tests_pytorch/plugins/test_checkpoint_io_plugin.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/pytorch/callbacks/model_checkpoint.py.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to src/lightning/pytorch/callbacks/model_checkpoint.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/pytorch/callbacks/model_checkpoint.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/pytorch/callbacks/model_checkpoint.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@codecov
Copy link

codecov bot commented Oct 8, 2023

Codecov Report

Merging #18748 (02331ea) into master (7434c47) will decrease coverage by 34%.
The diff coverage is 100%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #18748      +/-   ##
==========================================
- Coverage      83%      49%     -34%     
==========================================
  Files         439      431       -8     
  Lines       34469    34324     -145     
==========================================
- Hits        28706    16871   -11835     
- Misses       5763    17453   +11690     

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR shouldn't close #4335 which advocates for splitting the ModelCheckpoint class into smaller pieces with separate functionality. Instead of the current mess of all flags interacting with each other in a single class

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
tests/tests_pytorch/models/test_restore.py Show resolved Hide resolved
@mergify mergify bot removed the has conflicts label Oct 10, 2023
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback: model checkpoint feature Is an improvement or enhancement fun Staff contributions outside working hours - to differentiate from the "community" label pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

save_last: True saves 2 checkpoints every time Store last.ckpt as symlink when appropriate to save space
3 participants