-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7306 +/- ##
=======================================
+ Coverage 86% 87% +1%
=======================================
Files 200 200
Lines 12865 12875 +10
=======================================
+ Hits 11119 11242 +123
+ Misses 1746 1633 -113 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm,
just minor comments for improvement
# val_check_batch is inf for iterable datasets with no length defined | ||
# TODO: let training/eval loop handle logic around limit_*_batches and val_check_batch | ||
is_val_check_batch = False | ||
if isinstance(self.trainer.limit_train_batches, int) and self.trainer.val_check_batch == float('inf'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this refactor!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaushikb11 thanks! it still feels complicated to me. part of that is from limit_train_batches
/ val_check_interval
having different types and possible meanings depending on both depending on the user input and dataloader specified.
i'm wondering what's a better way to split "when to stop training mid-epoch" vs when to run validation or if a split is needed at all.
@ananthsub There's a failing test other than the unrelated deepspeed failing test. |
@kaushikb11 for some reason I cannot reproduce the test failure locally. but it's pointing out an issue from #6671 which is that the train loop force calling checkpointing on train end is bad because we cannot guarantee the monitor value is present @awaelchli @carmocca any suggestions on how to debug in this case, or why the failure isn't consistent across versions? |
Hello @ananthsub! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-05-03 15:28:50 UTC |
@@ -114,7 +114,7 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None: | |||
def _move_optimizer_state(self) -> None: | |||
""" Moves the state of the optimizers to the GPU if needed. """ | |||
for opt in self.optimizers: | |||
state = defaultdict(dict) | |||
state: DefaultDict = defaultdict(dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sent this out as a different PR here: #7318
but wanted it here to ensure tests were passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! LGTM 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Lightning-AI#7306) * bugfix-dataloading * rm-logs * Update CHANGELOG.md * Update test_dataloaders.py * Update test_dataloaders.py * Update training_loop.py * Update test_dataloaders.py * Update CHANGELOG.md * Update CHANGELOG.md * Update test_dataloaders.py * Update training_loop.py * Update training_loop.py * comments * address comments * more tests * Update progress.py * Update test_dataloaders.py * Update test_dataloaders.py * Update training_loop.py * Update training_loop.py * test ckpt fix? * update again
What does this PR do?
Fixes #7303
Fixes #6332
The check in the training loop for whether to run validation was not accounting for
val_check_batch
beinginf
which occurs for iterable style datasets which don't have len defined. https://github.com/PyTorchLightning/pytorch-lightning/blob/490cc57809ebeba19003b4101393a8a058217c31/pytorch_lightning/trainer/data_loading.py#L288-L290I think the dataloading logic is too complicated here. At the very least we need to consolidate logic across the dataloading mixin, debug connector, and training loop.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃