Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

ananthsub · 2021-04-26T04:32:29Z

🐛 Bug

Inside the training loop, we incorrectly skip running evaluation when reload_dataloaders_every_epoch=True and num_sanity_val_steps=0. With these settings, we defer setting the validation dataloader on the trainer until the evaluation loop is run from inside the training loop. However, this is too late as the training loop depends on the validation dataloader settings being set in order to even determine whether we run the evaluation loop at all.

This means it's possible to have these states set inside of the training loop when determining whether to run the evaluation loop:

is_last_batch=True
should_check_val=True
num_val_batches=[]
should_skip_eval=True
disable_validation=False
should_train_only=True

should_skip_eval=True when self.trainer.num_val_batches isn't set: In this instance trainer.num_val_batches=[] .
https://github.com/PyTorchLightning/pytorch-lightning/blob/44d775fccfb825561937f6fa03fe258af25c2b83/pytorch_lightning/trainer/training_loop.py#L551

This points out that should_check_val and should_train_only were not consistent with each other :(

#6075 changed the order with which we call run_evaluation inside the training loop. Before, this was covered up by luck because of the ordering. After the swap occurred there, this has been broken.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1z9ln3gYBK-VGidNPdUE2UgE0ISAgjLpu?usp=sharing

To Reproduce

Use following BoringModel and post here

Expected behavior

Checkpointing should still work as expected because we run the evaluation loop when expected

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

ananthsub added bug Something isn't working help wanted Open to be worked on labels Apr 26, 2021

ananthsub mentioned this issue Apr 26, 2021

[fix] Attach train+val dataloaders to trainer in trainer loop #7207

Merged

11 tasks

ananthsub closed this as completed in #7207 Apr 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

ananthsub commented Apr 26, 2021 •

edited

Loading

Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

Comments

ananthsub commented Apr 26, 2021 • edited Loading

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

ananthsub commented Apr 26, 2021 •

edited

Loading