Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU Training Issues and Feature Requests #902

Closed
srush opened this issue Feb 20, 2020 · 2 comments · Fixed by #926 or #932
Closed

TPU Training Issues and Feature Requests #902

srush opened this issue Feb 20, 2020 · 2 comments · Fixed by #926 or #932
Assignees
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@srush
Copy link
Contributor

srush commented Feb 20, 2020

TPU training is really cool. A couple issues that came up in training.

  • There is no longer a call to optimizer_step this is needed for setting learning rates that change over each step. Not sure how to fix this except by maybe having another "post_optimizer" call, which might be a better place for this style of stepping.

  • According to this tutorial https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md clip_grad_norm_ is very slow in XLA. The give a suggestion for a replacement function call. It would be nice if lightning just called this instead.

  • Nice to have: Point to that speed tutorial in the docs so people don't get stuck.

  • When using DDP, we have been using proc_rank to decide when to print debug statements. Might be nice to have a value that works with DDP and TPU to abstract away from using this term.

@srush srush added bug Something isn't working help wanted Open to be worked on labels Feb 20, 2020
@Borda Borda added this to the 0.6.1 milestone Feb 20, 2020
williamFalcon added a commit that referenced this issue Feb 25, 2020
@williamFalcon
Copy link
Contributor

@srush
can the .where also work when not on TPUs. The docs suggest that it's always a faster way of clipping grads?

williamFalcon added a commit that referenced this issue Feb 25, 2020
* added get dataloaders directly using a getter

* deleted decorator

* added prepare_data hook

* refactored dataloader init

* refactored dataloader init

* added dataloader reset flag and main loop

* added dataloader reset flag and main loop

* added dataloader reset flag and main loop

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixes #909

* fixes #909

* bug fix

* Fixes #902
@srush
Copy link
Contributor Author

srush commented Feb 25, 2020

Yeah, think that .where is likely always better.

tullie pushed a commit to tullie/pytorch-lightning that referenced this issue Apr 3, 2020
* added get dataloaders directly using a getter

* deleted decorator

* added prepare_data hook

* refactored dataloader init

* refactored dataloader init

* added dataloader reset flag and main loop

* added dataloader reset flag and main loop

* added dataloader reset flag and main loop

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* made changes

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed bad loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixed error in .fit with loaders

* fixes Lightning-AI#909

* fixes Lightning-AI#909

* bug fix

* Fixes Lightning-AI#902
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
3 participants