Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader is reloaded twice after resuming from checkpoint #9502

Closed
ninginthecloud opened this issue Sep 13, 2021 · 1 comment · Fixed by #9671
Closed

Dataloader is reloaded twice after resuming from checkpoint #9502

ninginthecloud opened this issue Sep 13, 2021 · 1 comment · Fixed by #9671
Assignees
Labels
bug Something isn't working checkpointing Related to checkpointing data handling Generic data-related topic help wanted Open to be worked on let's do it! approved to implement

Comments

@ninginthecloud
Copy link
Contributor

ninginthecloud commented Sep 13, 2021

🚀 Feature

Motivation

PyTorch Lightning reloads dataloader twice after resuming from checkpoint:
the first time reloads before train loop start [link];
The second reloads of default value for reload_dataloaders_every_n_epochs [link]
This behavior could be problematic:

  • This behavior could bring user confusion. They expect reload_dataloaders_every_n_epochs = 1 forces the dataloaders to reload once every n epochs, but it loads multiple times sometimes.
  • The dataloader loading behavior is inconsistent between resuming from checkpoint and fresh training start. This could break the internal states of dataloader, which requires users to keep track of global states.
  • It may impact Fault tolerance training correctly captures the state of dataloader

Pitch

Let's remove self.reset_train_val_dataloaders(model) in _run_train() [link]
Because reset_*_dataloader() will be called in fit loop, there's no need to call it in trainer.

Alternatives

N/A

Additional context

cc: @awaelchli @ananthsub


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ninginthecloud ninginthecloud added bug Something isn't working help wanted Open to be worked on data handling Generic data-related topic checkpointing Related to checkpointing labels Sep 13, 2021
@tchaton
Copy link
Contributor

tchaton commented Sep 14, 2021

Hey @ninginthecloud,

Quite catch. I believe this logic could definitely benefit from a code-health refactor. Feel free to add it to the sprint if you plan to work on it.

Best,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing data handling Generic data-related topic help wanted Open to be worked on let's do it! approved to implement
Projects
None yet
2 participants