Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop loading a few properties if checkpoint's dirpath has changed #12045

Merged
merged 16 commits into from
Feb 28, 2022
Merged

Stop loading a few properties if checkpoint's dirpath has changed #12045

merged 16 commits into from
Feb 28, 2022

Conversation

krshrimali
Copy link
Contributor

@krshrimali krshrimali commented Feb 22, 2022

What does this PR do?

Fixes #11379

Does your PR introduce any breaking changes? If yes, please list them.

Yes, if the checkpoint has changed on resumed training, the following properties won't be reloaded:

  1. best_model_score
  2. kth_best_model_path
  3. kth_value
  4. best_k_models

Only last_model_path and best_model_path will be tracked.

Additionally, a warning will be raised if the checkpoint path has changed.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@krshrimali krshrimali marked this pull request as ready for review February 23, 2022 05:56
CHANGELOG.md Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@carmocca carmocca added this to the 1.6 milestone Feb 24, 2022
@carmocca carmocca added callback: model checkpoint feature Is an improvement or enhancement labels Feb 24, 2022
@mergify mergify bot added ready PRs ready to be merged has conflicts labels Feb 24, 2022
@mergify mergify bot removed the has conflicts label Feb 25, 2022
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Show resolved Hide resolved
@awaelchli
Copy link
Contributor

awaelchli commented Feb 25, 2022

Here is a full example of what I mean in my comments above.

  1. The metric still gets tracked and we keep saving the best checkpoints. This is IMO expected behavior. All we want is to avoid that the checkpoints in the old folder get deleted and this is what this PR very well addresses <3 However, let's change the wording in the messages and comments.

  2. A bug remains where last.ckpt gets removed from the old directory. Demonstrated below as well with the assert statements (I'm not requesting to solve this bug in this PR, just pointing it out as part of my review)

import os
import shutil

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("latest_is_best", self.global_step)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    shutil.rmtree("./first", ignore_errors=True)
    shutil.rmtree("./after-reload", ignore_errors=True)

    model = BoringModel()
    checkpoint = ModelCheckpoint(
        dirpath="./first", monitor="latest_is_best", mode="max", save_top_k=3, save_last=True,
        every_n_train_steps=1,
        filename='{epoch}-{step}-{latest_is_best}',
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=5,
        max_epochs=1,
        enable_model_summary=False,
        callbacks=[checkpoint],
    )
    trainer.fit(model, train_dataloaders=train_data)

    # NOTE: last exists here, but not later on.
    assert os.path.isfile("./first/last.ckpt")


    # ------------------------
    # RESUME WITH CHANGED PATH
    checkpoint = ModelCheckpoint(
        dirpath="./after-reload", monitor="latest_is_best", mode="max", save_top_k=3, save_last=True,
        every_n_train_steps=1,
        filename='{epoch}-{step}-{latest_is_best}',
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=5,
        max_epochs=2,   # MORE EPOCHS
        enable_model_summary=False,
        callbacks=[checkpoint],
    )
    trainer.fit(model, train_dataloaders=train_data, ckpt_path="./first/epoch=0-step=3-latest_is_best=3.0.ckpt")

    # OBSERVE: last.ckpt has disappeared from the first folder
    # BUG?
    assert not os.path.isfile("./first/last.ckpt")


if __name__ == "__main__":
    run()

image

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Show resolved Hide resolved
@krshrimali
Copy link
Contributor Author

@awaelchli - Thank you so much for taking a look. You are right, the words were misleading, and they should be fixed now. Regarding the bug you mentioned, that's legit, and I'll create a follow-up PR to fix it. This also involves re-thinking if last_model_path should be loaded, and also modifying tests.

@awaelchli awaelchli changed the title Stop tracking a few properties if checkpoint's dirpath has changed Stop loading a few properties if checkpoint's dirpath has changed Feb 28, 2022
@awaelchli awaelchli enabled auto-merge (squash) February 28, 2022 16:21
@codecov
Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #12045 (f491509) into master (61dd5e4) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #12045    +/-   ##
========================================
- Coverage      92%      88%    -4%     
========================================
  Files         205      205            
  Lines       17440    17472    +32     
========================================
- Hits        15980    15336   -644     
- Misses       1460     2136   +676     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Includes a breaking change callback: model checkpoint feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resuming training with ModelCheckpoint can delete checkpoints in other runs
4 participants