From b1a7b7e9bf6f6b1bac34c4e2be687cff4c362663 Mon Sep 17 00:00:00 2001 From: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Date: Wed, 26 May 2021 18:11:00 +0530 Subject: [PATCH] Add `tpuvm` section in TPU docs (#7714) --- docs/source/advanced/tpu.rst | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/docs/source/advanced/tpu.rst b/docs/source/advanced/tpu.rst index 09a614f31c854..33bd630e5af30 100644 --- a/docs/source/advanced/tpu.rst +++ b/docs/source/advanced/tpu.rst @@ -163,6 +163,26 @@ TPUs work in DDP mode by default (distributing over each core) ---------------- +TPU VM +------ +Lightning supports training on the new Cloud TPU VMs. +Previously, we needed separate VMs to connect to the TPU machines, but as +Cloud TPU VMs run on the TPU Host machines, it allows direct SSH access +for the users. Hence, this architecture upgrade leads to cheaper and significantly +better performance and usability while working with TPUs. + +The TPUVMs come pre-installed with latest versions of PyTorch and PyTorch XLA. +After connecting to the VM and before running your Lightning code, you would need +to set the XRT TPU device configuration. + +.. code-block:: bash + + $ export XRT_TPU_CONFIG="localservice;0;localhost:51011" + +You could learn more about the Cloud TPU VM architecture `here `_ + +---------------- + TPU Pod ------- To train on more than 8 cores, your code actually doesn't change! @@ -173,7 +193,7 @@ All you need to do is submit the following command: $ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=torch-xla-nightly - -- python /usr/share/torch-xla-0.5/pytorch/xla/test/test_train_imagenet.py --fake_data + -- python /usr/share/torch-xla-1.8.1/pytorch/xla/test/test_train_imagenet.py --fake_data See `this guide `_ on how to set up the instance groups and VMs needed to run TPU Pods.