Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init'ing Dataloader calls get_train_dataloader #922

Closed
srush opened this issue Feb 23, 2020 · 5 comments · Fixed by #926
Closed

Init'ing Dataloader calls get_train_dataloader #922

srush opened this issue Feb 23, 2020 · 5 comments · Fixed by #926
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@srush
Copy link
Contributor

srush commented Feb 23, 2020

It seems like the code of initializing the dataloader calls into getting the dataloader.

https://github.com/PyTorchLightning/pytorch-lightning/blob/c00a8a10dd32fa43a659a09b53e2dea3739c6d4e/pytorch_lightning/trainer/data_loading.py#L69

This means that all the effort for wrapping get_dataloader to sync through barriers for multi-gpu / tpu is not used on this first call (and results in a crash).

@srush srush added bug Something isn't working help wanted Open to be worked on labels Feb 23, 2020
@williamFalcon
Copy link
Contributor

ummm good point. I think we need to simplify the get_XXX_dataloader() calls. The lazy loading decorator does make it harder for people to debug their data loading issues.

@neggert do you remember the original reason we added the decorator? maybe it's time to remove it and simplify this logic?

@williamFalcon
Copy link
Contributor

@ethanwharris @jakubczakon @MattPainter01
any thoughts?

@ethanwharris
Copy link
Member

ethanwharris commented Feb 23, 2020

I always assumed the decorator was to stop multiple instantiation - there was some old bug where data loading threads would hang around after each epoch because new data loaders were created and the old threads just carried on - having said that I can't find the issue anywhere

The IterableDataset stuff at the moment is a bit fragile (mostly hard coded type checks), there might be better ways to deal with it that simplify the above

If there's some way we can remove the decorator but still only create the dataloader once then that would be a big usability improvement :)

@williamFalcon
Copy link
Contributor

agreed. that was the original reason. basically we could refactor to make sure we only call it at time of the epoch beginning.

i think we needed it before to determine length and some other reasons

@srush
Copy link
Contributor Author

srush commented Feb 23, 2020

I'm stuck on a couple issues here actually that I can't unwind.

Main Issue: I don't really understand the semantics of train_dataloader in ddp / tpu training. Is it supposed to be called by only (a) with rank 0 or (b) with all ranks. I would prefer (a) but I need to know so I don't call barriers internally. If it is (b) then I do need to do that, but the semantics are more clear. My assumption had been (b) which works for DDP for me (TPU I'm still stuck).

Side Issue: It's very difficult to determine ordering. I had been calling training_data() from configure_optimizer, but doing that seems to preempt everything and lead to strange behavior.

A related issue to this is that loading the data set 8x times on TPU blows up the limited amount of RAM to Colab allows for. It would be nice to avoid this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants