Skip to content

Commit

Permalink
Add tpuvm section in TPU docs (#7714)
Browse files Browse the repository at this point in the history
  • Loading branch information
kaushikb11 authored May 26, 2021
1 parent 311d9fe commit b1a7b7e
Showing 1 changed file with 21 additions and 1 deletion.
22 changes: 21 additions & 1 deletion docs/source/advanced/tpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,26 @@ TPUs work in DDP mode by default (distributing over each core)

----------------

TPU VM
------
Lightning supports training on the new Cloud TPU VMs.
Previously, we needed separate VMs to connect to the TPU machines, but as
Cloud TPU VMs run on the TPU Host machines, it allows direct SSH access
for the users. Hence, this architecture upgrade leads to cheaper and significantly
better performance and usability while working with TPUs.

The TPUVMs come pre-installed with latest versions of PyTorch and PyTorch XLA.
After connecting to the VM and before running your Lightning code, you would need
to set the XRT TPU device configuration.

.. code-block:: bash
$ export XRT_TPU_CONFIG="localservice;0;localhost:51011"
You could learn more about the Cloud TPU VM architecture `here <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_vms_3>`_

----------------

TPU Pod
-------
To train on more than 8 cores, your code actually doesn't change!
Expand All @@ -173,7 +193,7 @@ All you need to do is submit the following command:
$ python -m torch_xla.distributed.xla_dist
--tpu=$TPU_POD_NAME
--conda-env=torch-xla-nightly
-- python /usr/share/torch-xla-0.5/pytorch/xla/test/test_train_imagenet.py --fake_data
-- python /usr/share/torch-xla-1.8.1/pytorch/xla/test/test_train_imagenet.py --fake_data
See `this guide <https://cloud.google.com/tpu/docs/tutorials/pytorch-pod>`_
on how to set up the instance groups and VMs needed to run TPU Pods.
Expand Down

0 comments on commit b1a7b7e

Please sign in to comment.