Support IterableDatasets for validation and test, not just train set [blocked by #953] #948

Darktex · 2020-02-26T03:28:15Z

🚀 Feature

Currently Lightning supports IterableDatasets only in the training set (see code). This makes them second-class citizens compared to the map-style datasets, and supporting them seems a low hanging fruit.

Motivation

This enables having larger test sets that may not fit into a machine's memory (they could be very large in production settings, or of modest size running in a student's cheap laptop). Moreover,
datasets are usually generated together (eg train, val, test can come from the same process). It is very likely that the same process has the same signature, so you may end up having IterableDatasets even when their size may not deem it strictly necessary.

Pitch

Changing a few lines of code by bringing in the checks we are doing for training should be enough unless I'm missing something.

Additional context

Are there any gotchas that make this harder than it looks?

The text was updated successfully, but these errors were encountered:

github-actions · 2020-02-26T03:28:53Z

Hey, thanks for your contribution! Great first issue!

williamFalcon · 2020-02-26T11:05:18Z

@Darktex this looks straightforward! I can’t think if any gotchas right now. The only thing would be if you don’t have the length of a dataset up front but i think we’re refactoring to clear that up right now.

want to do a PR?

@ethanwharris @jeffling thoughts?

fyi @srush @luiscape

ethanwharris · 2020-02-26T12:13:25Z

It seems there's an opportunity to clean stuff up a bit here. Really the only check we need is to see if len(dataloader) raises an error. If it does, then check if number of steps to run is set elsewhere and throw a warning if not (i.e. if not set elsewhere this will just run forever). That way you could get rid of the check for whether IterableDataset exists and the dependence on DataLoader.dataset, solving several issues.

williamFalcon · 2020-02-26T12:16:30Z

maybe step 1 is to refactor the code to minimize the len(dataloader) calls? we likely only need them to:

figure out when to do validation checks (percent into epoch)
set the tqdm bar length

ethanwharris · 2020-02-26T12:18:39Z

Agreed. Then it would be easier to see where the IterableDataset stuff will fall over, and just do something different when len is not available.

williamFalcon · 2020-02-26T12:21:59Z

Ok, #953 is blocking this issue at the moment.

williamFalcon · 2020-03-07T00:00:12Z

@ethanwharris @Darktex i think 0.7.1 fixed this problem. Mind checking now?

ethanwharris · 2020-03-09T13:46:12Z

@williamFalcon Not quite, still tires to call len on val / test dataloders - will PR in a bit

williamFalcon · 2020-03-09T13:51:17Z

is the easier thing to try catch for the len exception and set to inf if caught?

then when the epoch ends, set the length when we know it?

williamFalcon · 2020-03-09T13:51:33Z

is the easier thing to try catch for the len exception and set to inf if caught?

then when the epoch ends, set the length when we know it?

ethanwharris · 2020-03-09T13:53:34Z

Yeah, that's the plan - currently have the is_infinite_dataloader method which tries to call len and catches the exception, just need to get the TQDM stuff to not do total=float('inf') as that raises an error

ethanwharris · 2020-03-09T13:54:10Z

Not sure about setting the lenght once we know it - maybe in a seperate PR?

Darktex added feature Is an improvement or enhancement help wanted Open to be worked on labels Feb 26, 2020

williamFalcon mentioned this issue Feb 26, 2020

refactor len(datasets) call. #953

Closed

williamFalcon changed the title ~~Support IterableDatasets for validation and test, not just train set~~ Support IterableDatasets for validation and test, not just train set [blocked by #953] Feb 26, 2020

ethanwharris self-assigned this Mar 9, 2020

ethanwharris mentioned this issue Mar 9, 2020

Add support for IterableDatasets everywhere #1104

Merged

2 tasks

williamFalcon closed this as completed in #1104 Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support IterableDatasets for validation and test, not just train set [blocked by #953] #948

Support IterableDatasets for validation and test, not just train set [blocked by #953] #948

Darktex commented Feb 26, 2020

github-actions bot commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

ethanwharris commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

ethanwharris commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

williamFalcon commented Mar 7, 2020

ethanwharris commented Mar 9, 2020

williamFalcon commented Mar 9, 2020

williamFalcon commented Mar 9, 2020

ethanwharris commented Mar 9, 2020

ethanwharris commented Mar 9, 2020

Support IterableDatasets for validation and test, not just train set [blocked by #953] #948

Support IterableDatasets for validation and test, not just train set [blocked by #953] #948

Comments

Darktex commented Feb 26, 2020

🚀 Feature

Motivation

Pitch

Additional context

github-actions bot commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

ethanwharris commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

ethanwharris commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

williamFalcon commented Mar 7, 2020

ethanwharris commented Mar 9, 2020

williamFalcon commented Mar 9, 2020

williamFalcon commented Mar 9, 2020

ethanwharris commented Mar 9, 2020

ethanwharris commented Mar 9, 2020