-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support IterableDatasets for validation and test, not just train set [blocked by #953] #948
Comments
Hey, thanks for your contribution! Great first issue! |
@Darktex this looks straightforward! I can’t think if any gotchas right now. The only thing would be if you don’t have the length of a dataset up front but i think we’re refactoring to clear that up right now. want to do a PR? @ethanwharris @jeffling thoughts? |
It seems there's an opportunity to clean stuff up a bit here. Really the only check we need is to see if |
maybe step 1 is to refactor the code to minimize the len(dataloader) calls? we likely only need them to:
|
Agreed. Then it would be easier to see where the |
Ok, #953 is blocking this issue at the moment. |
@ethanwharris @Darktex i think 0.7.1 fixed this problem. Mind checking now? |
@williamFalcon Not quite, still tires to call len on val / test dataloders - will PR in a bit |
is the easier thing to try catch for the len exception and set to inf if caught? then when the epoch ends, set the length when we know it? |
1 similar comment
is the easier thing to try catch for the len exception and set to inf if caught? then when the epoch ends, set the length when we know it? |
Yeah, that's the plan - currently have the |
Not sure about setting the lenght once we know it - maybe in a seperate PR? |
🚀 Feature
Currently Lightning supports
IterableDatasets
only in the training set (see code). This makes them second-class citizens compared to the map-style datasets, and supporting them seems a low hanging fruit.Motivation
This enables having larger test sets that may not fit into a machine's memory (they could be very large in production settings, or of modest size running in a student's cheap laptop). Moreover,
datasets are usually generated together (eg train, val, test can come from the same process). It is very likely that the same process has the same signature, so you may end up having IterableDatasets even when their size may not deem it strictly necessary.
Pitch
Changing a few lines of code by bringing in the checks we are doing for training should be enough unless I'm missing something.
Additional context
Are there any gotchas that make this harder than it looks?
The text was updated successfully, but these errors were encountered: