Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Lightning-AI / pytorch-lightning Public

Notifications You must be signed in to change notification settings
Fork 3.4k
Star 28.7k

Code
Issues 836
Pull requests 60
Discussions
Actions
Projects
Wiki
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Wiki
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Stop loading a few properties if checkpoint's `dirpath` has changed #12045

Merged

awaelchli merged 16 commits into Lightning-AI:master from krshrimali:fix/model-checkpoint-resuming

Feb 28, 2022

Merged

Stop loading a few properties if checkpoint's `dirpath` has changed #12045

awaelchli merged 16 commits into Lightning-AI:master from krshrimali:fix/model-checkpoint-resuming

Feb 28, 2022

Conversation 40 Commits 16 Checks 0 Files changed

Conversation

Copy link

Contributor

krshrimali commented Feb 22, 2022 •

edited

Loading

What does this PR do?

When resuming training, if a new checkpoint dirpath is provided, current state of on_load_checkpoint will track all the properties (like best_model_score, kth_best_model_path, etc.). We should only track last_model_path and best_model_path and stop tracking best_model_score, kth_best_model_path, kth_value, best_k_models if the checkpoint path has changed. Please see issue Resuming training with ModelCheckpoint can delete checkpoints in other runs #11379 and comment by @rohitgr7 here: Resuming training with ModelCheckpoint can delete checkpoints in other runs #11379 (comment).
This PR also adds a warning in case the path has changed, telling the user what will be tracked and what not.
__resolve_ckpt_dir has been moved from on_pretrain_routine_start to _setup function. Please see the discussion here for more context.

Fixes #11379

Does your PR introduce any breaking changes? If yes, please list them.

Yes, if the checkpoint has changed on resumed training, the following properties won't be reloaded:

best_model_score
kth_best_model_path
kth_value
best_k_models

Only last_model_path and best_model_path will be tracked.

Additionally, a warning will be raised if the checkpoint path has changed.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Sorry, something went wrong.

All reactions


          Handle ModelCheckpoint's on_load_checkpoint behavior for changed path

f5e1c8b

krshrimali requested review from williamFalcon, tchaton, carmocca, Borda and kaushikb11 as code owners

February 22, 2022 15:39

krshrimali marked this pull request as draft

February 22, 2022 15:39


          Update tests and code, get realpath now

12b1c89

carmocca reviewed

View reviewed changes

tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

krshrimali added 3 commits

February 23, 2022 09:04


          Update test + what is saved and what not, address review

09a5830


          Refactor, from code review

61d5bd2


          Update doc string with a note

8004da7

krshrimali marked this pull request as ready for review

February 23, 2022 05:56

krshrimali requested review from SeanNaren, awaelchli, justusschock and rohitgr7 as code owners

February 23, 2022 05:56


          Add entry to changelog

093b6ec

rohitgr7 reviewed

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Show resolved Hide resolved

krshrimali and others added 4 commits

February 23, 2022 18:47


          Apply suggestions from code review

e1add0a

Co-authored-by: Rohit Gupta <[email protected]>


          Remove raises: warning, as per review

10ccfe0


          minor typo fix in the test

ee4cdf4


          Move __resolve_ckpt_dir from on_pretrain_routine_start to setup

ac6a82f

rohitgr7 reviewed

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

carmocca added this to the 1.6 milestone

carmocca added callback: model checkpoint feature Is an improvement or enhancement labels

carmocca approved these changes

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

krshrimali and others added 2 commits

February 24, 2022 19:06


          Update pytorch_lightning/callbacks/model_checkpoint.py

8f0afec

Co-authored-by: Rohit Gupta <[email protected]>


          Add test for dirpath - dynamic

a5130d5


          Whitespace fix

6af13d6

rohitgr7 approved these changes

View reviewed changes

mergify bot added ready

PRs ready to be merged

has conflicts labels


          Merge branch 'master' into fix/model-checkpoint-resuming

262936d

mergify bot removed the has conflicts label

awaelchli approved these changes

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

tests/checkpointing/test_model_checkpoint.py Show resolved Hide resolved

Copy link

Contributor

awaelchli commented Feb 25, 2022 •

edited

Loading

Here is a full example of what I mean in my comments above.

The metric still gets tracked and we keep saving the best checkpoints. This is IMO expected behavior. All we want is to avoid that the checkpoints in the old folder get deleted and this is what this PR very well addresses <3 However, let's change the wording in the messages and comments.
A bug remains where last.ckpt gets removed from the old directory. Demonstrated below as well with the assert statements (I'm not requesting to solve this bug in this PR, just pointing it out as part of my review)

import os
import shutil

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("latest_is_best", self.global_step)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    shutil.rmtree("./first", ignore_errors=True)
    shutil.rmtree("./after-reload", ignore_errors=True)

    model = BoringModel()
    checkpoint = ModelCheckpoint(
        dirpath="./first", monitor="latest_is_best", mode="max", save_top_k=3, save_last=True,
        every_n_train_steps=1,
        filename='{epoch}-{step}-{latest_is_best}',
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=5,
        max_epochs=1,
        enable_model_summary=False,
        callbacks=[checkpoint],
    )
    trainer.fit(model, train_dataloaders=train_data)

    # NOTE: last exists here, but not later on.
    assert os.path.isfile("./first/last.ckpt")


    # ------------------------
    # RESUME WITH CHANGED PATH
    checkpoint = ModelCheckpoint(
        dirpath="./after-reload", monitor="latest_is_best", mode="max", save_top_k=3, save_last=True,
        every_n_train_steps=1,
        filename='{epoch}-{step}-{latest_is_best}',
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=5,
        max_epochs=2,   # MORE EPOCHS
        enable_model_summary=False,
        callbacks=[checkpoint],
    )
    trainer.fit(model, train_dataloaders=train_data, ckpt_path="./first/epoch=0-step=3-latest_is_best=3.0.ckpt")

    # OBSERVE: last.ckpt has disappeared from the first folder
    # BUG?
    assert not os.path.isfile("./first/last.ckpt")


if __name__ == "__main__":
    run()

krshrimali reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

krshrimali commented

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved

tests/checkpointing/test_model_checkpoint.py Show resolved Hide resolved


          Apply suggestions from code review

a9ed0e6

Copy link

Contributor Author

krshrimali commented Feb 28, 2022

@awaelchli - Thank you so much for taking a look. You are right, the words were misleading, and they should be fixed now. Regarding the bug you mentioned, that's legit, and I'll create a follow-up PR to fix it. This also involves re-thinking if last_model_path should be loaded, and also modifying tests.

All reactions

Sorry, something went wrong.


          move to the note section before, as per review

f491509

awaelchli changed the title ~~Stop tracking a few properties if checkpoint's dirpath has changed~~ Stop loading a few properties if checkpoint's dirpath has changed

awaelchli enabled auto-merge (squash)

February 28, 2022 16:21

Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #12045 (f491509) into master (61dd5e4) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #12045    +/-   ##
========================================
- Coverage      92%      88%    -4%     
========================================
  Files         205      205            
  Lines       17440    17472    +32     
========================================
- Hits        15980    15336   -644     
- Misses       1460     2136   +676

All reactions

Sorry, something went wrong.

awaelchli merged commit 02ccd87 into Lightning-AI:master

rohitgr7 mentioned this pull request

Resuming training with ModelCheckpoint can delete checkpoints in other runs #11379

Closed

krshrimali mentioned this pull request

Prevent last checkpoint being deleted after resumed training with changed dirpath #12225

Merged

12 tasks

krshrimali added the breaking change Includes a breaking change label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

awaelchli awaelchli approved these changes

rohitgr7 rohitgr7 approved these changes

carmocca carmocca approved these changes

williamFalcon Awaiting requested review from williamFalcon

tchaton Awaiting requested review from tchaton tchaton is a code owner

Borda Awaiting requested review from Borda Borda is a code owner

kaushikb11 Awaiting requested review from kaushikb11

SeanNaren Awaiting requested review from SeanNaren

justusschock Awaiting requested review from justusschock justusschock is a code owner

Assignees

No one assigned

Labels

breaking change

Includes a breaking change

callback: model checkpoint feature

Is an improvement or enhancement

PRs ready to be merged

Projects

None yet

Milestone

pl:1.6

Development

Successfully merging this pull request may close these issues.

Resuming training with ModelCheckpoint can delete checkpoints in other runs

4 participants

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.