-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting started with xla #2041
Comments
Hi @lezwon contribution are definitely welcome! Unfortunately we currently don't have good wiki/write-ups of our software stacks and how they are connected. It's on our todo list but we haven't had the bandwidth to do it. If you don't mind reading code directly, we'd love answer your questions and maybe produce some documentation along the process. |
Hey @ailzhang, I will try to do that. What I'm currently attempting to do is trying to train a separate model in parallel on each TPU device. Like this. I just want to be sure if it's actually using the same TPU core I assigned it. I assumed i would get that through
|
It is not possible to run different graphs on different cores. |
@dlibenzi I am training a model via KFold method. A different fold on every core. Similar to this: https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel |
@lezwon , the example you shared executes the same model on all the cores, providing different slices of the data based on fold. |
@ultrons yes, it does. I'm working on implementing this functionality in Pytorch Lightning. The reason I raised this issue was that the training is really slow with MNIST when the batch_size is 64. I was confused if it was actually using all the cores. I added |
@lezwon local_ordinal is the ordinal corresponding to the device on one host. It ranges from 0-7 (because 8 devices per host). the ordinal is the same as local_ordinal if there is only one host. But if you are doing multi-host training, then it ranks all the devices across all hosts, so it'll range from 0 to N*8-1 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
❓ Questions and Help
Hi there,
I'd like to learn how xla works and why some functions exist in this repo. For example,
what is the difference between between
get_local_ordinal
andget_ordinal
. Why is it used?Why can nprocs be only 1 or 8 in xla_multiprocessing?
A doc that explains xla_env_vars.
I basically want to start contributing but am unable to find enough resources to help me understand how this repo works. If anyone could point me to any docs that could help me get started, it would be great. Thanks :)
The text was updated successfully, but these errors were encountered: