checkpointing docs (#495)

Summary: Pull Request resolved: #495 Added OSS checkpointing docs Reviewed By: daniellepintz Differential Revision: D46036738 fbshipit-source-id: 6d40a854f09c597c6a7503229f48423df532f060
pytorch · Aug 9, 2023 · a9ad674 · a9ad674
1 parent 5150591
commit a9ad674
Show file tree

Hide file tree

Showing 2 changed files with 65 additions and 1 deletion.
diff --git a/docs/source/checkpointing.rst b/docs/source/checkpointing.rst
@@ -0,0 +1,63 @@
+Checkpointing
+================================
+
+TorchTNT offers checkpointing via the :class:`~torchtnt.framework.callbacks.TorchSnapshotSaver` which uses `TorchSnapshot <https://github.com/pytorch/torchsnapshot>`_ under the hood.
+
+.. code-block:: python
+
+    module = nn.Linear(input_dim, 1)
+    unit = MyAutoUnit(module=module)
+    tss = TorchSnapshotSaver(
+        dirpath=your_dirpath_here,
+        save_every_n_train_steps=100,
+        save_every_n_epochs=2,
+    )
+    # loads latest checkpoint, if it exists
+    if latest_checkpoint_dir:
+        tss.restore_from_latest(your_dirpath_here, unit, train_dataloader=dataloader)
+    train(
+        unit,
+        dataloader,
+        callbacks=[tss]
+    )
+
+There is built-in support for saving and loading distributed models (DDP, FSDP).
+
+The state dict type to be used for checkpointing FSDP modules can be specified in the :class:`~torchtnt.utils.prepare_module.FSDPStrategy`'s state_dict_type argument like so:
+
+.. code-block:: python
+
+    module = nn.Linear(input_dim, 1)
+    fsdp_strategy = FSDPStrategy(
+        # sets state dict type of FSDP module
+        state_dict_type=STATE_DICT_TYPE.SHARDED_STATE_DICT
+    )
+    unit = MyAutoUnit(module=module, strategy=fsdp_strategy)
+    tss = TorchSnapshotSaver(
+        dirpath=your_dirpath_here,
+        save_every_n_epochs=2,
+    )
+    train(
+        unit,
+        dataloader,
+        # checkpointer callback will use state dict type specified in FSDPStrategy
+        callbacks=[tss]
+    )
+
+Or you can manually set this using `FSDP.set_state_dict_type <https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.set_state_dict_type>`_.
+
+.. code-block:: python
+
+    module = nn.Linear(input_dim, 1)
+    fsdp_strategy = FSDPStrategy()
+    unit = MyAutoUnit(module=module, strategy=fsdp_strategy)
+    FSDP.set_state_dict_type(unit.module, StateDictType.SHARDED_STATE_DICT)
+    tss = TorchSnapshotSaver(
+        dirpath=your_dirpath_here,
+        save_every_n_epochs=2,
+    )
+    train(
+        unit,
+        dataloader,
+        callbacks=[tss]
+    )
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -59,9 +59,10 @@ Documentation
 
 .. toctree::
    :maxdepth: 1
-   :caption: Distributed
+   :caption: Core Concepts
 
    distributed
+   checkpointing
 
 .. toctree::
    :maxdepth: 1