Init'ing Dataloader calls get_train_dataloader #922

srush · 2020-02-23T20:09:13Z

It seems like the code of initializing the dataloader calls into getting the dataloader.

https://github.com/PyTorchLightning/pytorch-lightning/blob/c00a8a10dd32fa43a659a09b53e2dea3739c6d4e/pytorch_lightning/trainer/data_loading.py#L69

This means that all the effort for wrapping get_dataloader to sync through barriers for multi-gpu / tpu is not used on this first call (and results in a crash).

williamFalcon · 2020-02-23T20:14:29Z

ummm good point. I think we need to simplify the get_XXX_dataloader() calls. The lazy loading decorator does make it harder for people to debug their data loading issues.

@neggert do you remember the original reason we added the decorator? maybe it's time to remove it and simplify this logic?

williamFalcon · 2020-02-23T20:35:20Z

@ethanwharris @jakubczakon @MattPainter01
any thoughts?

ethanwharris · 2020-02-23T20:55:31Z

I always assumed the decorator was to stop multiple instantiation - there was some old bug where data loading threads would hang around after each epoch because new data loaders were created and the old threads just carried on - having said that I can't find the issue anywhere

The IterableDataset stuff at the moment is a bit fragile (mostly hard coded type checks), there might be better ways to deal with it that simplify the above

If there's some way we can remove the decorator but still only create the dataloader once then that would be a big usability improvement :)

williamFalcon · 2020-02-23T21:13:36Z

agreed. that was the original reason. basically we could refactor to make sure we only call it at time of the epoch beginning.

i think we needed it before to determine length and some other reasons

srush · 2020-02-23T22:02:02Z

I'm stuck on a couple issues here actually that I can't unwind.

Main Issue: I don't really understand the semantics of train_dataloader in ddp / tpu training. Is it supposed to be called by only (a) with rank 0 or (b) with all ranks. I would prefer (a) but I need to know so I don't call barriers internally. If it is (b) then I do need to do that, but the semantics are more clear. My assumption had been (b) which works for DDP for me (TPU I'm still stuck).

Side Issue: It's very difficult to determine ordering. I had been calling training_data() from configure_optimizer, but doing that seems to preempt everything and lead to strange behavior.

A related issue to this is that loading the data set 8x times on TPU blows up the limited amount of RAM to Colab allows for. It would be nice to avoid this issue.

srush added bug Something isn't working help wanted Open to be worked on labels Feb 23, 2020

williamFalcon mentioned this issue Feb 25, 2020

Clean up dataloader logic #926

Merged

williamFalcon closed this as completed in #926 Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init'ing Dataloader calls get_train_dataloader #922

Init'ing Dataloader calls get_train_dataloader #922

srush commented Feb 23, 2020

williamFalcon commented Feb 23, 2020

williamFalcon commented Feb 23, 2020

ethanwharris commented Feb 23, 2020 •

edited

Loading

williamFalcon commented Feb 23, 2020

srush commented Feb 23, 2020 •

edited

Loading

Init'ing Dataloader calls get_train_dataloader #922

Init'ing Dataloader calls get_train_dataloader #922

Comments

srush commented Feb 23, 2020

williamFalcon commented Feb 23, 2020

williamFalcon commented Feb 23, 2020

ethanwharris commented Feb 23, 2020 • edited Loading

williamFalcon commented Feb 23, 2020

srush commented Feb 23, 2020 • edited Loading

ethanwharris commented Feb 23, 2020 •

edited

Loading

srush commented Feb 23, 2020 •

edited

Loading