Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove reset_train_val_dataloaders from Trainer and move data reloading logic to loop #9671

Merged

Conversation

ninginthecloud
Copy link
Contributor

What does this PR do?

Fixes #9502

This PR aims to fix the bug mentioned in issue #9502.
During fit loop, train_dataloader is loaded twice when resuming from checkpoint. This underlying behavior does not meet user's expectation.
As discussed in an closed PR #9614, we will follow the second option to fix this issue:

  • Option 2: remove self.reset_train_val_dataloaders(model) in Trainer._run_train() and set the reload logics in corresponding loops. Additionally,
  1. self.num_training_batches = 0 in trainer._setup_on_init(). This value is initialized as self.num_training_batches = float('inf'),
  2. fit_loop.done and fit_loop.skip has been updated.
@property
    def done(self) -> bool:
        """Evaluates when to leave the loop.

        Returns True if trainer.should_stop was set (e.g. by early stopping) or if the maximum number of steps or epochs
        is reached.
        """
        return stop_steps or should_stop or stop_epochs or self.trainer.num_training_batches == 0

@property
    def skip(self) -> bool:
        """Whether we should skip the training and immediately return from the call to :meth:`run`."""
        return self.done or self.trainer.limit_train_batches == 0

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@ninginthecloud ninginthecloud changed the title remove reset_train_val_dataloaders(model) from Trainer and move data reloading logic to loop remove reset_train_val_dataloaders from Trainer and move data reloading logic to loop Sep 23, 2021
@ninginthecloud ninginthecloud marked this pull request as ready for review September 23, 2021 18:16
@codecov
Copy link

codecov bot commented Sep 23, 2021

Codecov Report

Merging #9671 (36f5fe4) into master (83ce1bf) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #9671    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         179     180     +1     
  Lines       15810   15869    +59     
=======================================
- Hits        14635   14071   -564     
- Misses       1175    1798   +623     

pytorch_lightning/loops/base.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/fit_loop.py Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
tests/callbacks/test_progress_bar.py Outdated Show resolved Hide resolved
tests/trainer/test_data_loading.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/base.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/fit_loop.py Show resolved Hide resolved
pytorch_lightning/loops/fit_loop.py Show resolved Hide resolved
tests/callbacks/test_progress_bar.py Outdated Show resolved Hide resolved
tests/trainer/test_data_loading.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
tests/plugins/test_deepspeed_plugin.py Outdated Show resolved Hide resolved
tests/trainer/test_dataloaders.py Outdated Show resolved Hide resolved
@carmocca
Copy link
Contributor

Marking this as a draft as a merge seems to have gone wrong, mark it as ready when it's ready again!

@carmocca carmocca marked this pull request as draft October 14, 2021 20:43
@ninginthecloud ninginthecloud force-pushed the fix/avoid_reload_dl_9502_2 branch from ffd2fdb to f65bd5e Compare October 14, 2021 22:35
@ninginthecloud ninginthecloud marked this pull request as ready for review October 14, 2021 23:54
@ninginthecloud
Copy link
Contributor Author

Marking this as a draft as a merge seems to have gone wrong, mark it as ready when it's ready again!

I've got rebase and made this PR in a clean stage. Looking for another round of review, thank you~ 😃 cc: @carmocca, @tchaton

@awaelchli awaelchli added bug Something isn't working data handling Generic data-related topic labels Oct 16, 2021
pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved
tests/trainer/test_dataloaders.py Show resolved Hide resolved
@awaelchli awaelchli added this to the v1.5 milestone Oct 16, 2021
@mergify mergify bot removed the has conflicts label Oct 18, 2021
pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/fit_loop.py Show resolved Hide resolved
@mergify mergify bot added ready PRs ready to be merged has conflicts labels Oct 18, 2021
@ninginthecloud
Copy link
Contributor Author

Hi, @carmocca The value of self.trainer.num_training_batches is set when train_dataloader is loaded, otherwise, it's just default value. However, since we moved reset_train_val_dataloaders to fit_loop, it's impossible to get the valid evaluation of skip. Therefore, I replaced it with limit_train_batches==0 to skip loop based on the logic defined in data_loading.py

@mergify mergify bot removed the has conflicts label Oct 19, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@ninginthecloud
Copy link
Contributor Author

ninginthecloud commented Oct 19, 2021

a lot of test failures due to commit 720288e Let me update them

@carmocca carmocca merged commit 0b68f2a into Lightning-AI:master Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data handling Generic data-related topic ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataloader is reloaded twice after resuming from checkpoint
5 participants