TPU Training Issues and Feature Requests #902

srush · 2020-02-20T05:28:31Z

TPU training is really cool. A couple issues that came up in training.

There is no longer a call to optimizer_step this is needed for setting learning rates that change over each step. Not sure how to fix this except by maybe having another "post_optimizer" call, which might be a better place for this style of stepping.
According to this tutorial https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md clip_grad_norm_ is very slow in XLA. The give a suggestion for a replacement function call. It would be nice if lightning just called this instead.
Nice to have: Point to that speed tutorial in the docs so people don't get stuck.
When using DDP, we have been using proc_rank to decide when to print debug statements. Might be nice to have a value that works with DDP and TPU to abstract away from using this term.

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-02-25T03:19:40Z

@srush
can the .where also work when not on TPUs. The docs suggest that it's always a faster way of clipping grads?

* added get dataloaders directly using a getter * deleted decorator * added prepare_data hook * refactored dataloader init * refactored dataloader init * added dataloader reset flag and main loop * added dataloader reset flag and main loop * added dataloader reset flag and main loop * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixes #909 * fixes #909 * bug fix * Fixes #902

srush · 2020-02-25T03:48:00Z

Yeah, think that .where is likely always better.

* added get dataloaders directly using a getter * deleted decorator * added prepare_data hook * refactored dataloader init * refactored dataloader init * added dataloader reset flag and main loop * added dataloader reset flag and main loop * added dataloader reset flag and main loop * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * made changes * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed bad loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixed error in .fit with loaders * fixes Lightning-AI#909 * fixes Lightning-AI#909 * bug fix * Fixes Lightning-AI#902

srush added bug Something isn't working help wanted Open to be worked on labels Feb 20, 2020

Borda assigned williamFalcon Feb 20, 2020

Borda added this to the 0.6.1 milestone Feb 20, 2020

williamFalcon added a commit that referenced this issue Feb 25, 2020

Fixes #902

3173ad3

williamFalcon mentioned this issue Feb 25, 2020

Clean up dataloader logic #926

Merged

williamFalcon closed this as completed in #926 Feb 25, 2020

williamFalcon mentioned this issue Feb 25, 2020

Tpu features #932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU Training Issues and Feature Requests #902

TPU Training Issues and Feature Requests #902

srush commented Feb 20, 2020

williamFalcon commented Feb 25, 2020

srush commented Feb 25, 2020

TPU Training Issues and Feature Requests #902

TPU Training Issues and Feature Requests #902

Comments

srush commented Feb 20, 2020

williamFalcon commented Feb 25, 2020

srush commented Feb 25, 2020