remove `reset_train_val_dataloaders` from Trainer and move data reloading logic to loop #9671

ninginthecloud · 2021-09-23T17:35:01Z

What does this PR do?

This PR aims to fix the bug mentioned in issue #9502.
During fit loop, train_dataloader is loaded twice when resuming from checkpoint. This underlying behavior does not meet user's expectation.
As discussed in an closed PR #9614, we will follow the second option to fix this issue:

Option 2: remove self.reset_train_val_dataloaders(model) in Trainer._run_train() and set the reload logics in corresponding loops. Additionally,

self.num_training_batches = 0 in trainer._setup_on_init(). This value is initialized as self.num_training_batches = float('inf'),
fit_loop.done and fit_loop.skip has been updated.

@property
    def done(self) -> bool:
        """Evaluates when to leave the loop.

        Returns True if trainer.should_stop was set (e.g. by early stopping) or if the maximum number of steps or epochs
        is reached.
        """
        return stop_steps or should_stop or stop_epochs or self.trainer.num_training_batches == 0

@property
    def skip(self) -> bool:
        """Whether we should skip the training and immediately return from the call to :meth:`run`."""
        return self.done or self.trainer.limit_train_batches == 0

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-09-23T18:37:38Z

Codecov Report

Merging #9671 (36f5fe4) into master (83ce1bf) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #9671    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         179     180     +1     
  Lines       15810   15869    +59     
=======================================
- Hits        14635   14071   -564     
- Misses       1175    1798   +623

pytorch_lightning/loops/base.py

pytorch_lightning/loops/fit_loop.py

pytorch_lightning/trainer/trainer.py

tests/callbacks/test_progress_bar.py

tests/trainer/test_data_loading.py

pytorch_lightning/loops/base.py

pytorch_lightning/loops/fit_loop.py

tests/callbacks/test_progress_bar.py

tests/trainer/test_data_loading.py

CHANGELOG.md

pytorch_lightning/loops/fit_loop.py

pytorch_lightning/trainer/trainer.py

tests/plugins/test_deepspeed_plugin.py

tests/trainer/test_dataloaders.py

carmocca · 2021-10-14T20:43:04Z

Marking this as a draft as a merge seems to have gone wrong, mark it as ready when it's ready again!

…ng logic to loop

ninginthecloud · 2021-10-15T04:38:08Z

Marking this as a draft as a merge seems to have gone wrong, mark it as ready when it's ready again!

I've got rebase and made this PR in a clean stage. Looking for another round of review, thank you~ 😃 cc: @carmocca, @tchaton

pytorch_lightning/loops/fit_loop.py

tests/trainer/test_dataloaders.py

pytorch_lightning/loops/fit_loop.py

ninginthecloud · 2021-10-19T04:41:10Z

Hi, @carmocca The value of self.trainer.num_training_batches is set when train_dataloader is loaded, otherwise, it's just default value. However, since we moved reset_train_val_dataloaders to fit_loop, it's impossible to get the valid evaluation of skip. Therefore, I replaced it with limit_train_batches==0 to skip loop based on the logic defined in data_loading.py

tchaton

LGTM !

Co-authored-by: ananthsub <[email protected]>

ninginthecloud · 2021-10-19T18:13:05Z

a lot of test failures due to commit 720288e Let me update them

ninginthecloud mentioned this pull request Sep 23, 2021

add _is_fresh_start_epoch attribute to fit_loop to avoid reload dl twice when training starts #9614

Closed

12 tasks

ninginthecloud changed the title ~~remove reset_train_val_dataloaders(model) from Trainer and move data reloading logic to loop~~ remove reset_train_val_dataloaders from Trainer and move data reloading logic to loop Sep 23, 2021

ninginthecloud marked this pull request as ready for review September 23, 2021 18:16

ninginthecloud requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners September 23, 2021 18:16

ananthsub reviewed Sep 24, 2021

View reviewed changes

mergify bot added the has conflicts label Sep 25, 2021

tchaton reviewed Sep 27, 2021

View reviewed changes

ninginthecloud requested a review from rohitgr7 as a code owner October 6, 2021 06:04

mergify bot added has conflicts and removed has conflicts labels Oct 6, 2021

tchaton reviewed Oct 14, 2021

View reviewed changes

ninginthecloud requested a review from edenlightning as a code owner October 14, 2021 19:20

carmocca marked this pull request as draft October 14, 2021 20:43

ninginthecloud added 2 commits October 14, 2021 14:53

remove reset_train_val_dataloaders from Trainer and move data reloadi…

03b1a08

…ng logic to loop

update changelog

f65bd5e

ninginthecloud force-pushed the fix/avoid_reload_dl_9502_2 branch from ffd2fdb to f65bd5e Compare October 14, 2021 22:35

ninginthecloud marked this pull request as ready for review October 14, 2021 23:54

awaelchli added bug Something isn't working data handling Generic data-related topic labels Oct 16, 2021

adjust test_dataloaders_reset_and_attach

e76592a

awaelchli approved these changes Oct 16, 2021

View reviewed changes

pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved

tests/trainer/test_dataloaders.py Show resolved Hide resolved

awaelchli added 2 commits October 16, 2021 12:17

set limits for val/test/pred

f1290c1

update changelog

916cd59

awaelchli added this to the v1.5 milestone Oct 16, 2021

mergify bot added the has conflicts label Oct 16, 2021

Merge branch 'master' into fix/avoid_reload_dl_9502_2

6ed9efa

mergify bot removed the has conflicts label Oct 18, 2021

mark as protected

76381f6

carmocca approved these changes Oct 18, 2021

View reviewed changes

pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved

pytorch_lightning/loops/fit_loop.py Show resolved Hide resolved

mergify bot added ready PRs ready to be merged has conflicts labels Oct 18, 2021

Merge branch 'master' into fix/avoid_reload_dl_9502_2

6477f6d

mergify bot removed the has conflicts label Oct 19, 2021

tchaton approved these changes Oct 19, 2021

View reviewed changes

ninginthecloud and others added 2 commits October 19, 2021 10:24

update current_epoch in on_run_end

720288e

Update pytorch_lightning/loops/fit_loop.py comment about skip

cafd3fa

Co-authored-by: ananthsub <[email protected]>

fix current_epoch in fit_loop

36f5fe4

carmocca merged commit 0b68f2a into Lightning-AI:master Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove `reset_train_val_dataloaders` from Trainer and move data reloading logic to loop #9671

remove `reset_train_val_dataloaders` from Trainer and move data reloading logic to loop #9671

ninginthecloud commented Sep 23, 2021

codecov bot commented Sep 23, 2021 •

edited

Loading

carmocca commented Oct 14, 2021

ninginthecloud commented Oct 15, 2021

ninginthecloud commented Oct 19, 2021

tchaton left a comment

ninginthecloud commented Oct 19, 2021 •

edited

Loading

remove reset_train_val_dataloaders from Trainer and move data reloading logic to loop #9671

remove reset_train_val_dataloaders from Trainer and move data reloading logic to loop #9671

Conversation

ninginthecloud commented Sep 23, 2021

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

codecov bot commented Sep 23, 2021 • edited Loading

Codecov Report

carmocca commented Oct 14, 2021

ninginthecloud commented Oct 15, 2021

ninginthecloud commented Oct 19, 2021

tchaton left a comment

Choose a reason for hiding this comment

ninginthecloud commented Oct 19, 2021 • edited Loading

remove `reset_train_val_dataloaders` from Trainer and move data reloading logic to loop #9671

remove `reset_train_val_dataloaders` from Trainer and move data reloading logic to loop #9671

codecov bot commented Sep 23, 2021 •

edited

Loading

ninginthecloud commented Oct 19, 2021 •

edited

Loading