`ModelCheckpoint` does NOT save anything if `every_n_train_steps` is greater than the number of training steps in a epoch #11979

ShaneTian · 2022-02-18T06:04:10Z

🐛 Bug

ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in an epoch.
The pl does NOT call validation loop if val_check_interval is greater than the number of training steps in an epoch.

To Reproduce

In my experiment, the number of training steps in an epoch is about 110.

If I set up the every_n_train_steps and val_check_interval to 100, the ModelCheckpoint and validation loop work well.

log_steps = 100
valid_steps = 100
save_steps = 100

ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
    monitor="val_loss",
    save_top_k=-1,
    every_n_train_steps=100
)

trainer = pl.Trainer(
    gpus=-1,
    accelerator="gpu",
    strategy="ddp",
    logger=logger,
    callbacks=[ckpt_callback],
    max_epochs=200,
    log_every_n_steps=log_steps,
    val_check_interval=valid_steps
)

If I set up the every_n_train_steps and val_check_interval to 120, the ModelCheckpoint and validation loop fail. ModelCheckpoint does not save anything, the pl does NOT call validation loop.

log_steps = 120
valid_steps = 120
save_steps = 120

ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
    monitor="val_loss",
    save_top_k=-1,
    every_n_train_steps=100
)

trainer = pl.Trainer(
    gpus=-1,
    accelerator="gpu",
    strategy="ddp",
    logger=logger,
    callbacks=[ckpt_callback],
    max_epochs=200,
    log_every_n_steps=log_steps,
    val_check_interval=valid_steps
)

Expected behavior

Is this expected? Can't I set these two parameters more than one epoch?

Environment

PyTorch Lightning Version (e.g., 1.5.0): 1.5.10
PyTorch Version (e.g., 1.10): 1.10.2+cu113
Python version (e.g., 3.9): 3.7.10
OS (e.g., Linux): Linux
CUDA/cuDNN version: CUDA V11.0.221
GPU models and configuration:
How you installed PyTorch (conda, pip, source): pip

cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

The text was updated successfully, but these errors were encountered:

carmocca · 2022-02-21T16:38:30Z

@ShaneTian Is your expectation that ModelCheckpoint would run in the 10th batch of the second epoch given your example?

These values are always assumed to be constrained within the number of batches. We could raise a warning if that's not the case.

rohitgr7 · 2022-02-21T17:16:11Z

val_check_interval is constrained within number of batches within an epoch but every_n_train_steps of ModelCheckpoint isn't I think.
https://github.com/PyTorchLightning/pytorch-lightning/blob/9b0942d7315390ba01ee25ff90587e40435b5a8d/pytorch_lightning/callbacks/model_checkpoint.py#L267-L279

I'd suggest we can think of extending val_check_intervalto work with global step instead after v1.6? since this issue has come up a lot in the past?

ShaneTian · 2022-02-22T02:30:29Z

@ShaneTian Is your expectation that ModelCheckpoint would run in the 10th batch of the second epoch given your example?

These values are always assumed to be constrained within the number of batches. We could raise a warning if that's not the case.

Yes, intuitively, it should be a global training step.

carmocca · 2022-02-22T17:12:35Z

I'd suggest we can think of extending val_check_intervalto work with global step instead after v1.6? since this issue has come up a lot in the past?

Yes. The proposal is being tracked in #8135 (comment) (just linked this issue too)

rohitgr7 · 2022-02-22T17:30:45Z

someone's already working on it 🚀
#11993

ShaneTian · 2022-03-26T07:49:49Z

Any updates about this issue? This feature is really useful for scenarios in which the training set is small or validation is expensive.

carmocca · 2022-04-21T12:40:14Z

Support has been added with #11993

sivannavis · 2024-06-17T18:52:32Z

Hi! I'm still facing this problem, I'm training a really small dataset, so I only have 5 iterations per epoch, and I'm not running validation at all, I set limit_val_batches to 0.

I just want to save the trained model of the latest epoch, but it seems like it's not saving anything unless I specifically set every_n_train_steps to be smaller than my iteration steps. Here's my callbacks:

ModelCheckpoint(
                dirpath=ckpt_path,
                filename="{epoch}--{train/loss_epoch:.2f}--{global_step:.0f}",
                monitor="global_step",
                mode="max",
                auto_insert_metric_name=True,
                save_top_k=3,
                save_last=True,
            )

ShaneTian added the bug Something isn't working label Feb 18, 2022

awaelchli added the callback: model checkpoint label Feb 20, 2022

carmocca mentioned this issue Feb 22, 2022

Extend val_check_interval support #8135

Closed

carmocca added this to the 1.7 milestone Feb 22, 2022

carmocca closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ModelCheckpoint` does NOT save anything if `every_n_train_steps` is greater than the number of training steps in a epoch #11979

`ModelCheckpoint` does NOT save anything if `every_n_train_steps` is greater than the number of training steps in a epoch #11979

ShaneTian commented Feb 18, 2022 •

edited by github-actions bot

Loading

carmocca commented Feb 21, 2022

rohitgr7 commented Feb 21, 2022

ShaneTian commented Feb 22, 2022

carmocca commented Feb 22, 2022

rohitgr7 commented Feb 22, 2022

ShaneTian commented Mar 26, 2022

carmocca commented Apr 21, 2022

sivannavis commented Jun 17, 2024

ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in a epoch #11979

ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in a epoch #11979

Comments

ShaneTian commented Feb 18, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

carmocca commented Feb 21, 2022

rohitgr7 commented Feb 21, 2022

ShaneTian commented Feb 22, 2022

carmocca commented Feb 22, 2022

rohitgr7 commented Feb 22, 2022

ShaneTian commented Mar 26, 2022

carmocca commented Apr 21, 2022

sivannavis commented Jun 17, 2024

`ModelCheckpoint` does NOT save anything if `every_n_train_steps` is greater than the number of training steps in a epoch #11979

`ModelCheckpoint` does NOT save anything if `every_n_train_steps` is greater than the number of training steps in a epoch #11979

ShaneTian commented Feb 18, 2022 •

edited by github-actions bot

Loading