Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

Closed
ananthsub opened this issue Apr 26, 2021 · 0 comments · Fixed by #7207
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Apr 26, 2021

🐛 Bug

Inside the training loop, we incorrectly skip running evaluation when reload_dataloaders_every_epoch=True and num_sanity_val_steps=0. With these settings, we defer setting the validation dataloader on the trainer until the evaluation loop is run from inside the training loop. However, this is too late as the training loop depends on the validation dataloader settings being set in order to even determine whether we run the evaluation loop at all.

This means it's possible to have these states set inside of the training loop when determining whether to run the evaluation loop:

is_last_batch=True
should_check_val=True
num_val_batches=[]
should_skip_eval=True
disable_validation=False
should_train_only=True

should_skip_eval=True when self.trainer.num_val_batches isn't set: In this instance trainer.num_val_batches=[] .
https://github.com/PyTorchLightning/pytorch-lightning/blob/44d775fccfb825561937f6fa03fe258af25c2b83/pytorch_lightning/trainer/training_loop.py#L551

This points out that should_check_val and should_train_only were not consistent with each other :(

#6075 changed the order with which we call run_evaluation inside the training loop. Before, this was covered up by luck because of the ordering. After the swap occurred there, this has been broken.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1z9ln3gYBK-VGidNPdUE2UgE0ISAgjLpu?usp=sharing

To Reproduce

Use following BoringModel and post here

Expected behavior

Checkpointing should still work as expected because we run the evaluation loop when expected

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@ananthsub ananthsub added bug Something isn't working help wanted Open to be worked on labels Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant