-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix support for dataloader with None batches #7342
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7342 +/- ##
=======================================
- Coverage 91% 86% -5%
=======================================
Files 200 200
Lines 12916 13210 +294
=======================================
- Hits 11764 11317 -447
- Misses 1152 1893 +741 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
self.warning_cache.warn("train_dataloader yielded None. If this was on purpose, ignore this warning...") | ||
return AttributeDict( | ||
signal=0, | ||
grad_norm_dic=grad_norm_dic, | ||
training_step_output_for_epoch_end=batch_outputs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens in distributed training? won't we go out of sync across ranks if the batch is None on some ranks and non-None on others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. It will be closer to working than it was before, but this use case may still need some work for distributed. The new behaviour (that is, after this fix) should be the same as when returning None from a step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two options:
- either the trainer runs an all reduce at each step to determine whether all ranks skip. This is a general solution but wastes perf for any users who won't skip batches
- the user runs the all reduce in their lightning module and determines when to skip. This way it's specific to just modules which need to skip batches or steps for any reason.
I prefer the latter approach because I don't want to slow down the trainer for everyone to handle a few corner cases
What do you think @tchaton @SeanNaren @awaelchli @carmocca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's fine to make the user responsible for making all ranks skip in sync if they want distributed skipping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning None
in any kind of distributed setting or AMP is currently unsupported. We have this (very old) PR to add support but it's currently frozen
You can have a look at the proposal though
What does this PR do?
Fixes #6753
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃