resume_from_checkpoint broken when fault-tolerant feature enabled #8835

awaelchli · 2021-08-10T11:20:49Z

🐛 Bug

A trainer with the resume from checkpoint option does not continue training and stops immediately, despite increased max_epoch settings.

To Reproduce

import os
import unittest.mock

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        print(batch.sum())
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def on_train_epoch_end(self):
        print("epoch ended")

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

    def on_load_checkpoint(self, checkpoint):
        pass


@unittest.mock.patch.dict(os.environ, {"PL_FAULT_TOLERANT_TRAINING": "1"})
def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=3,
        limit_val_batches=0,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
    )
    trainer.fit(model, train_dataloader=train_data)

    trainer.save_checkpoint("lightning_logs/auto.pt")

    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=3,
        limit_val_batches=0,
        num_sanity_val_steps=0,
        max_epochs=3,
        weights_summary=None,
        resume_from_checkpoint="lightning_logs/auto.pt",
    )
    trainer.fit(model, train_dataloader=train_data)


if __name__ == "__main__":
    run()

Output:

Epoch 0:   0%|          | 0/3 [00:00<?, ?it/s] tensor(6.2218)
Epoch 0:  33%|███▎      | 1/3 [00:00<00:00, 83.40it/s, loss=-1.23, v_num=82]tensor(-9.5560)
Epoch 0:  67%|██████▋   | 2/3 [00:00<00:00, 120.73it/s, loss=-1.71, v_num=82]tensor(14.5846)
Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 140.56it/s, loss=-0.136, v_num=82]epoch ended
Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 98.17it/s, loss=-0.136, v_num=82] 

Restoring states from the checkpoint path at lightning_logs/auto.pt
Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 22712.84it/s]epoch ended
Epoch 2: 100%|██████████| 3/3 [00:00<00:00, 20004.63it/s]epoch ended
Epoch 2: 100%|██████████| 3/3 [00:00<00:00, 3038.62it/s]

Here the training_step does not get invoked on the second fit call. Instead, the epochs end immediately.

ONLY happens when PL_FAULT_TOLERANT_TRAINING=1. This is an experimental feature which is off by default.

Expected behavior

The training continues due to the increased max_epochs setting.

awaelchli · 2021-08-10T11:43:14Z

issue starts with this line: https://github.com/PyTorchLightning/pytorch-lightning/blob/3096ab88ebb1d9c9c07c3230829171925c9e678a/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L353

awaelchli added bug help wanted labels Aug 10, 2021

awaelchli self-assigned this Aug 10, 2021

awaelchli added the priority: 0 label Aug 10, 2021

awaelchli closed this as completed Aug 10, 2021

awaelchli removed priority: 0 bug help wanted labels Aug 10, 2021

awaelchli reopened this Aug 10, 2021

awaelchli added the bug label Aug 10, 2021

awaelchli changed the title ~~resume_from_checkpoint broken on in 1.4 - does not advance loop~~ resume_from_checkpoint broken when fault-tolerant feature enabled Aug 10, 2021

awaelchli added the priority: 1 label Aug 10, 2021

awaelchli added this to the v1.5 milestone Aug 10, 2021

awaelchli mentioned this issue Aug 10, 2021

fault-tolerant: fix resetting "current" progress at the end of successful epoch #8837

Closed

12 tasks

awaelchli mentioned this issue Sep 8, 2021

fix resuming from checkpoint for fault-tolerant in case of no failure #9371

Merged

12 tasks

awaelchli closed this as completed in #9371 Sep 10, 2021

awaelchli mentioned this issue Sep 14, 2021

[Feat] Add Fault Tolerant Training for ValidationLoop #9491

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume_from_checkpoint broken when fault-tolerant feature enabled #8835

resume_from_checkpoint broken when fault-tolerant feature enabled #8835

awaelchli commented Aug 10, 2021 •

edited

Loading

awaelchli commented Aug 10, 2021

resume_from_checkpoint broken when fault-tolerant feature enabled #8835

resume_from_checkpoint broken when fault-tolerant feature enabled #8835

Comments

awaelchli commented Aug 10, 2021 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

awaelchli commented Aug 10, 2021

awaelchli commented Aug 10, 2021 •

edited

Loading