Getting started with xla #2041

lezwon · 2020-05-07T12:15:22Z

❓ Questions and Help

Hi there,
I'd like to learn how xla works and why some functions exist in this repo. For example,
what is the difference between betweenget_local_ordinal and get_ordinal. Why is it used?
Why can nprocs be only 1 or 8 in xla_multiprocessing?
A doc that explains xla_env_vars.

I basically want to start contributing but am unable to find enough resources to help me understand how this repo works. If anyone could point me to any docs that could help me get started, it would be great. Thanks :)

The text was updated successfully, but these errors were encountered:

ailzhang · 2020-05-07T20:48:57Z

Hi @lezwon contribution are definitely welcome! Unfortunately we currently don't have good wiki/write-ups of our software stacks and how they are connected. It's on our todo list but we haven't had the bandwidth to do it. If you don't mind reading code directly, we'd love answer your questions and maybe produce some documentation along the process.
What I'd personally recommend here is instead of reading the whole codebase at once, I feel it's helpful to pick an issue to start with and expand your knowledge from one area to more. Thanks!

lezwon · 2020-05-14T11:58:43Z

Hey @ailzhang, I will try to do that. What I'm currently attempting to do is trying to train a separate model in parallel on each TPU device. Like this. I just want to be sure if it's actually using the same TPU core I assigned it. I assumed i would get that through xm.get_local_ordinal() or xm. get_ordinal(), but it always shows 0. I'm not sure how these variables are set and whats their purpose. Here are my doubts:

Does xm.xla_device() set any tpu core globally? I see it calls torch_xla._XLAC._xla_set_default_device(device) here
What does local_ordinal and ordinal represent?

dlibenzi · 2020-05-14T14:40:25Z

It is not possible to run different graphs on different cores.

lezwon · 2020-05-14T17:55:32Z

@dlibenzi I am training a model via KFold method. A different fold on every core. Similar to this: https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel

ultrons · 2020-05-15T13:36:24Z

@lezwon , the example you shared executes the same model on all the cores, providing different slices of the data based on fold.

lezwon · 2020-05-15T15:35:13Z

@ultrons yes, it does. I'm working on implementing this functionality in Pytorch Lightning. The reason I raised this issue was that the training is really slow with MNIST when the batch_size is 64. I was confused if it was actually using all the cores. I added torch_xla._XLAC._xla_get_default_device() during the training process to check the TPU core and it seems to be working correctly. Increasing the batch size from 64 to 1024 helped speed up the training process a great deal, but deteriorated the model accuracy. Is there something I'm doing wrong? Why is it so slow with 64 as batch_size?
Notebook: https://www.kaggle.com/lezwon/pytorch-lightning-parallel-tpu-tranining/

taylanbil · 2020-05-15T17:27:21Z

@lezwon local_ordinal is the ordinal corresponding to the device on one host. It ranges from 0-7 (because 8 devices per host).

the ordinal is the same as local_ordinal if there is only one host. But if you are doing multi-host training, then it ranks all the devices across all hosts, so it'll range from 0 to N*8-1

stale · 2020-06-14T19:36:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale Has not had recent activity label Jun 14, 2020

lezwon closed this as completed Jun 15, 2020

zcain117 added the kaggle label Jul 1, 2020

lezwon mentioned this issue Jul 21, 2020

fixing TPU tests Lightning-AI/pytorch-lightning#2632

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started with xla #2041

Getting started with xla #2041

lezwon commented May 7, 2020

ailzhang commented May 7, 2020

lezwon commented May 14, 2020

dlibenzi commented May 14, 2020

lezwon commented May 14, 2020

ultrons commented May 15, 2020

lezwon commented May 15, 2020

taylanbil commented May 15, 2020

stale bot commented Jun 14, 2020

Getting started with xla #2041

Getting started with xla #2041

Comments

lezwon commented May 7, 2020

❓ Questions and Help

ailzhang commented May 7, 2020

lezwon commented May 14, 2020

dlibenzi commented May 14, 2020

lezwon commented May 14, 2020

ultrons commented May 15, 2020

lezwon commented May 15, 2020

taylanbil commented May 15, 2020

stale bot commented Jun 14, 2020