Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in a epoch #11979

Closed
ShaneTian opened this issue Feb 18, 2022 · 8 comments
Labels
bug Something isn't working callback: model checkpoint
Milestone

Comments

@ShaneTian
Copy link

ShaneTian commented Feb 18, 2022

🐛 Bug

ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in an epoch.
The pl does NOT call validation loop if val_check_interval is greater than the number of training steps in an epoch.

To Reproduce

In my experiment, the number of training steps in an epoch is about 110.

  • If I set up the every_n_train_steps and val_check_interval to 100, the ModelCheckpoint and validation loop work well.
log_steps = 100
valid_steps = 100
save_steps = 100

ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
    monitor="val_loss",
    save_top_k=-1,
    every_n_train_steps=100
)

trainer = pl.Trainer(
    gpus=-1,
    accelerator="gpu",
    strategy="ddp",
    logger=logger,
    callbacks=[ckpt_callback],
    max_epochs=200,
    log_every_n_steps=log_steps,
    val_check_interval=valid_steps
)
  • If I set up the every_n_train_steps and val_check_interval to 120, the ModelCheckpoint and validation loop fail. ModelCheckpoint does not save anything, the pl does NOT call validation loop.
log_steps = 120
valid_steps = 120
save_steps = 120

ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
    monitor="val_loss",
    save_top_k=-1,
    every_n_train_steps=100
)

trainer = pl.Trainer(
    gpus=-1,
    accelerator="gpu",
    strategy="ddp",
    logger=logger,
    callbacks=[ckpt_callback],
    max_epochs=200,
    log_every_n_steps=log_steps,
    val_check_interval=valid_steps
)

Expected behavior

Is this expected? Can't I set these two parameters more than one epoch?

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.5.10
  • PyTorch Version (e.g., 1.10): 1.10.2+cu113
  • Python version (e.g., 3.9): 3.7.10
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: CUDA V11.0.221
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source): pip

cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

@ShaneTian ShaneTian added the bug Something isn't working label Feb 18, 2022
@carmocca
Copy link
Contributor

@ShaneTian Is your expectation that ModelCheckpoint would run in the 10th batch of the second epoch given your example?

These values are always assumed to be constrained within the number of batches. We could raise a warning if that's not the case.

@rohitgr7
Copy link
Contributor

val_check_interval is constrained within number of batches within an epoch but every_n_train_steps of ModelCheckpoint isn't I think.
https://github.com/PyTorchLightning/pytorch-lightning/blob/9b0942d7315390ba01ee25ff90587e40435b5a8d/pytorch_lightning/callbacks/model_checkpoint.py#L267-L279

I'd suggest we can think of extending val_check_intervalto work with global step instead after v1.6? since this issue has come up a lot in the past?

@ShaneTian
Copy link
Author

@ShaneTian Is your expectation that ModelCheckpoint would run in the 10th batch of the second epoch given your example?

These values are always assumed to be constrained within the number of batches. We could raise a warning if that's not the case.

Yes, intuitively, it should be a global training step.

@carmocca
Copy link
Contributor

I'd suggest we can think of extending val_check_intervalto work with global step instead after v1.6? since this issue has come up a lot in the past?

Yes. The proposal is being tracked in #8135 (comment) (just linked this issue too)

@carmocca carmocca added this to the 1.7 milestone Feb 22, 2022
@rohitgr7
Copy link
Contributor

someone's already working on it 🚀
#11993

@ShaneTian
Copy link
Author

Any updates about this issue? This feature is really useful for scenarios in which the training set is small or validation is expensive.

@carmocca
Copy link
Contributor

Support has been added with #11993

@sivannavis
Copy link

Hi! I'm still facing this problem, I'm training a really small dataset, so I only have 5 iterations per epoch, and I'm not running validation at all, I set limit_val_batches to 0.

I just want to save the trained model of the latest epoch, but it seems like it's not saving anything unless I specifically set every_n_train_steps to be smaller than my iteration steps. Here's my callbacks:

ModelCheckpoint(
                dirpath=ckpt_path,
                filename="{epoch}--{train/loss_epoch:.2f}--{global_step:.0f}",
                monitor="global_step",
                mode="max",
                auto_insert_metric_name=True,
                save_top_k=3,
                save_last=True,
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working callback: model checkpoint
Projects
None yet
Development

No branches or pull requests

5 participants