Skip to content

Commit

Permalink
Add figures to distributed tutorial (#8917)
Browse files Browse the repository at this point in the history
Co-authored-by: Jakub Pietrak <[email protected]>
Co-authored-by: JakubPietrakIntel <[email protected]>
Co-authored-by: Kinga Gajdamowicz <[email protected]>
Co-authored-by: ZhengHongming888 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
6 people authored Feb 15, 2024
1 parent 4c32243 commit fbd23bb
Show file tree
Hide file tree
Showing 6 changed files with 21 additions and 4 deletions.
Binary file added docs/source/_figures/dist_part.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_figures/dist_sampling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_figures/intel_kumo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@
'_static/thumbnails/explain.png',
'tutorial/shallow_node_embeddings':
'_static/thumbnails/shallow_node_embeddings.png',
'tutorial/distributed_pyg':
'_static/thumbnails/distributed_pyg.png',
'tutorial/multi_gpu_vanilla':
'_static/thumbnails/multi_gpu_vanilla.png',
'tutorial/multi_node_multi_gpu_vanilla':
Expand Down
23 changes: 19 additions & 4 deletions docs/source/tutorial/distributed_pyg.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
Distributed Training in PyG
===========================

.. figure:: ../_figures/intel_kumo.png
:width: 400px

.. note::
We are thrilled to announce the first **in-house distributed training solution** for :pyg:`PyG` via :class:`torch_geometric.distributed`, available from version 2.5 onwards.
Developers and researchers can now take full advantage of distributed training on large-scale datasets which cannot be fully loaded in memory of one machine at the same time.
Expand All @@ -15,11 +18,11 @@ Key Advantages
--------------

#. **Balanced graph partitioning** via METIS ensures minimal communication overhead when sampling subgraphs across compute nodes.
#. Utilizing **DDP for model training in conjunction with RPC for remote sampling and feature fetching routines** (with TCP/IP protocol and `gloo <https://github.com/facebookincubator/gloo>`_ communication backend) allows for data parallelism with distinct data partitions at each node.
#. Utilizing **DDP for model training** in conjunction with **RPC for remote sampling and feature fetching routines** (with TCP/IP protocol and `gloo <https://github.com/facebookincubator/gloo>`_ communication backend) allows for data parallelism with distinct data partitions at each node.
#. The implementation via custom :class:`~torch_geometric.data.GraphStore` and :class:`~torch_geometric.data.FeatureStore` APIs provides a flexible and tailored interface for distributing large graph structure information and feature storage.
#. Distributed neighbor sampling is capable of sampling in both local and remote partitions through RPC communication channels.
All advanced functionality of single-node sampling are also applicable for distributed training, *e.g.*, heterogeneous sampling, link-level sampling, temporal sampling, *etc*..
#. Distributed data loaders offer a high-level abstraction for managing sampler processes, ensuring simplicity and seamless integration with standard :pyg:`PyG` data loaders..
#. **Distributed neighbor sampling** is capable of sampling in both local and remote partitions through RPC communication channels.
All advanced functionality of single-node sampling are also applicable for distributed training, *e.g.*, heterogeneous sampling, link-level sampling, temporal sampling, *etc*.
#. **Distributed data loaders** offer a high-level abstraction for managing sampler processes, ensuring simplicity and seamless integration with standard :pyg:`PyG` data loaders.
#. Incorporating the Python `asyncio <https://docs.python.org/3/library/asyncio.html>`_ library for asynchronous processing on top of :pytorch:`PyTorch`-based RPCs further enhances the system's responsiveness and overall performance.

Architecture Components
Expand Down Expand Up @@ -57,6 +60,12 @@ This ensures that the resulting partitions provide maximal local access of neigh
Through this partitioning approach, every edge receives a distinct assignment, while "halo nodes" (1-hop neighbors that fall into a different partition) are replicated.
Halo nodes ensure that neighbor sampling for a single node in a single layer stays purely local.

.. figure:: ../_figures/dist_part.png
:align: center
:width: 100%

Graph partitioning with halo nodes.

In our distributed training example, we prepared the `partition_graph.py <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/distributed/pyg/partition_graph.py>`_ script to demonstrate how to apply partitioning on a selected subset of both homogeneous and heterogeneous graphs.
The :class:`~torch_geometric.distributed.Partitioner` can also preserve node features, edge features, and any temporal attributes at the level of nodes and edges.
Later on, each node in the cluster then owns a single partition of this graph.
Expand Down Expand Up @@ -174,6 +183,12 @@ A batch of seed nodes follows three main steps before it is made available for t
#. **Data conversion:** Based on the sampler output and the acquired node (or edge) features, a :pyg:`PyG` :class:`~torch_geometric.data.Data` or :class:`~torch_geometric.data.HeteroData` object is created.
This object forms a batch used in subsequent computational operations of the model.

.. figure:: ../_figures/dist_sampling.png
:align: center
:width: 450px

Local and remote neighbor sampling.

Distributed Data Loading
~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down

0 comments on commit fbd23bb

Please sign in to comment.