Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting started with xla #2041

Closed
lezwon opened this issue May 7, 2020 · 8 comments
Closed

Getting started with xla #2041

lezwon opened this issue May 7, 2020 · 8 comments
Labels
kaggle stale Has not had recent activity

Comments

@lezwon
Copy link
Contributor

lezwon commented May 7, 2020

❓ Questions and Help

Hi there,
I'd like to learn how xla works and why some functions exist in this repo. For example,
what is the difference between betweenget_local_ordinal and get_ordinal. Why is it used?
Why can nprocs be only 1 or 8 in xla_multiprocessing?
A doc that explains xla_env_vars.

I basically want to start contributing but am unable to find enough resources to help me understand how this repo works. If anyone could point me to any docs that could help me get started, it would be great. Thanks :)

@ailzhang
Copy link
Contributor

ailzhang commented May 7, 2020

Hi @lezwon contribution are definitely welcome! Unfortunately we currently don't have good wiki/write-ups of our software stacks and how they are connected. It's on our todo list but we haven't had the bandwidth to do it. If you don't mind reading code directly, we'd love answer your questions and maybe produce some documentation along the process.
What I'd personally recommend here is instead of reading the whole codebase at once, I feel it's helpful to pick an issue to start with and expand your knowledge from one area to more. Thanks!

@lezwon
Copy link
Contributor Author

lezwon commented May 14, 2020

Hey @ailzhang, I will try to do that. What I'm currently attempting to do is trying to train a separate model in parallel on each TPU device. Like this. I just want to be sure if it's actually using the same TPU core I assigned it. I assumed i would get that through xm.get_local_ordinal() or xm. get_ordinal(), but it always shows 0. I'm not sure how these variables are set and whats their purpose. Here are my doubts:

  1. Does xm.xla_device() set any tpu core globally? I see it calls torch_xla._XLAC._xla_set_default_device(device) here

  2. What does local_ordinal and ordinal represent?

@dlibenzi
Copy link
Collaborator

It is not possible to run different graphs on different cores.

@lezwon
Copy link
Contributor Author

lezwon commented May 14, 2020

@dlibenzi I am training a model via KFold method. A different fold on every core. Similar to this: https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel

@ultrons
Copy link
Contributor

ultrons commented May 15, 2020

@lezwon , the example you shared executes the same model on all the cores, providing different slices of the data based on fold.

@lezwon
Copy link
Contributor Author

lezwon commented May 15, 2020

@ultrons yes, it does. I'm working on implementing this functionality in Pytorch Lightning. The reason I raised this issue was that the training is really slow with MNIST when the batch_size is 64. I was confused if it was actually using all the cores. I added torch_xla._XLAC._xla_get_default_device() during the training process to check the TPU core and it seems to be working correctly. Increasing the batch size from 64 to 1024 helped speed up the training process a great deal, but deteriorated the model accuracy. Is there something I'm doing wrong? Why is it so slow with 64 as batch_size?
Notebook: https://www.kaggle.com/lezwon/pytorch-lightning-parallel-tpu-tranining/

@taylanbil
Copy link
Collaborator

@lezwon local_ordinal is the ordinal corresponding to the device on one host. It ranges from 0-7 (because 8 devices per host).

the ordinal is the same as local_ordinal if there is only one host. But if you are doing multi-host training, then it ranks all the devices across all hosts, so it'll range from 0 to N*8-1

@stale
Copy link

stale bot commented Jun 14, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Has not had recent activity label Jun 14, 2020
@lezwon lezwon closed this as completed Jun 15, 2020
@zcain117 zcain117 added the kaggle label Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kaggle stale Has not had recent activity
Projects
None yet
Development

No branches or pull requests

6 participants