Dataloader is reloaded twice after resuming from checkpoint #9502

ninginthecloud · 2021-09-13T23:52:24Z

🚀 Feature

Motivation

PyTorch Lightning reloads dataloader twice after resuming from checkpoint:
the first time reloads before train loop start [link];
The second reloads of default value for reload_dataloaders_every_n_epochs [link]
This behavior could be problematic:

This behavior could bring user confusion. They expect reload_dataloaders_every_n_epochs = 1 forces the dataloaders to reload once every n epochs, but it loads multiple times sometimes.
The dataloader loading behavior is inconsistent between resuming from checkpoint and fresh training start. This could break the internal states of dataloader, which requires users to keep track of global states.
It may impact Fault tolerance training correctly captures the state of dataloader

Pitch

Let's remove self.reset_train_val_dataloaders(model) in _run_train() [link]
Because reset_*_dataloader() will be called in fit loop, there's no need to call it in trainer.

Alternatives

N/A

Additional context

cc: @awaelchli @ananthsub

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

tchaton · 2021-09-14T11:25:22Z

Hey @ninginthecloud,

Quite catch. I believe this logic could definitely benefit from a code-health refactor. Feel free to add it to the sprint if you plan to work on it.

Best,
T.C

ninginthecloud added bug Something isn't working help wanted Open to be worked on data handling Generic data-related topic checkpointing Related to checkpointing labels Sep 13, 2021

tchaton assigned ninginthecloud Sep 14, 2021

tchaton added the let's do it! approved to implement label Sep 14, 2021

tchaton unassigned ninginthecloud Sep 14, 2021

ninginthecloud self-assigned this Sep 14, 2021

This was referenced Sep 20, 2021

add _is_fresh_start_epoch attribute to fit_loop to avoid reload dl twice when training starts #9614

Closed

Trainer.fit() multiple times #9636

Closed

remove reset_train_val_dataloaders from Trainer and move data reloading logic to loop #9671

Merged

carmocca closed this as completed in #9671 Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader is reloaded twice after resuming from checkpoint #9502

Dataloader is reloaded twice after resuming from checkpoint #9502

ninginthecloud commented Sep 13, 2021 •

edited

Loading

tchaton commented Sep 14, 2021 •

edited

Loading

Dataloader is reloaded twice after resuming from checkpoint #9502

Dataloader is reloaded twice after resuming from checkpoint #9502

Comments

ninginthecloud commented Sep 13, 2021 • edited Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

tchaton commented Sep 14, 2021 • edited Loading

ninginthecloud commented Sep 13, 2021 •

edited

Loading

tchaton commented Sep 14, 2021 •

edited

Loading