Merge pull request #146 from SMILELab-FL/docs

Docs
SMILELab-FL · Sep 25, 2021 · 0937b84 · 0937b84
2 parents c8792c0 + 9054776
commit 0937b84
Show file tree

Hide file tree

Showing 12 changed files with 311 additions and 19 deletions.
diff --git a/docs/imgs/data-partition/cifar10_balance_dir_alpha_0.3_100clients.png b/docs/imgs/data-partition/cifar10_balance_dir_alpha_0.3_100clients.png
diff --git a/docs/imgs/data-partition/cifar10_balance_iid_100clients.png b/docs/imgs/data-partition/cifar10_balance_iid_100clients.png
diff --git a/docs/imgs/data-partition/cifar10_hetero_dir_0.3_100clients.png b/docs/imgs/data-partition/cifar10_hetero_dir_0.3_100clients.png
diff --git a/docs/imgs/data-partition/cifar10_hetero_dir_0.3_100clients_dist.png b/docs/imgs/data-partition/cifar10_hetero_dir_0.3_100clients_dist.png
diff --git a/docs/imgs/data-partition/cifar10_shards_200_100clients.png b/docs/imgs/data-partition/cifar10_shards_200_100clients.png
diff --git a/...data-partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients.png b/...data-partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients.png
diff --git a/...partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients_dist.png b/...partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients_dist.png
diff --git a/docs/imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients.png b/docs/imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients.png
diff --git a/...imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients_dist.png b/...imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients_dist.png
diff --git a/docs/source/refs.bib b/docs/source/refs.bib
@@ -27,3 +27,26 @@ @article{lin2017deep
   journal={arXiv preprint arXiv:1712.01887},
   year={2017}
 }
+
+@inproceedings{acar2020federated,
+  title={Federated learning based on dynamic regularization},
+  author={Acar, Durmus Alp Emre and Zhao, Yue and Matas, Ramon and Mattina, Matthew and Whatmough, Paul and Saligrama, Venkatesh},
+  booktitle={International Conference on Learning Representations},
+  year={2020}
+}
+
+@inproceedings{yurochkin2019bayesian,
+  title={Bayesian nonparametric federated learning of neural networks},
+  author={Yurochkin, Mikhail and Agarwal, Mayank and Ghosh, Soumya and Greenewald, Kristjan and Hoang, Nghia and Khazaeni, Yasaman},
+  booktitle={International Conference on Machine Learning},
+  pages={7252--7261},
+  year={2019},
+  organization={PMLR}
+}
+
+@article{wang2020federated,
+  title={Federated learning with matched averaging},
+  author={Wang, Hongyi and Yurochkin, Mikhail and Sun, Yuekai and Papailiopoulos, Dimitris and Khazaeni, Yasaman},
+  journal={arXiv preprint arXiv:2002.06440},
+  year={2020}
+}
diff --git a/docs/source/tutorials/data_partition.rst b/docs/source/tutorials/data_partition.rst
@@ -1,41 +1,310 @@
 .. _data-partition:
 
-****************
-Data Partitioner
-****************
+***************
+DataPartitioner
+***************
 
-This chapter introduces the dataset partitioner ``DataPartitioner`` and how the client process uses the corresponding dataset. **FedLab** provides various methods to deal with different partition strategy corresponding with different dataset situations.
+Sophisticated in real world, FL need to handle various kind of data distribution scenarios, including
+iid and non-iid scenarios. Though there already exists some partition schemes for published data benchmark,
+it still can be very messy and hard for researchers to partition datasets according to their specific
+research problems, and maintain partition results during simulation. FedLab provides :class:`fedlab.utils.dataset.partition.DataPartitioner` that allows you to use pre-partitioned datasets as well as your own data. :class:`DataPartitioner` stores sample indices for each client given a data partition scheme.
+
+FedLab provides a number of pre-defined partition schemes for some datasets (such as CIFAR10) that subclass :class:`fedlab.utils.dataset.partition.DataPartitioner` and implement functions specific to particular partition scheme. They can be used to prototype and benchmark your FL algorithms.
 
-For classification datasets, FedLab provides noniid and random partition method.
 
 CIFAR10Partitioner
 ==================
 
+For CIFAR10, we provides 6 pre-defined partition schemes. We partition CIFAR10 with the following parameters:
+
+- ``targets`` is labels of dataset to partition
+- ``num_clients`` specifies number of clients in partition scheme
+- ``balance`` refers to FL scenario that sample numbers for different clients are the same
+- ``partition`` specifies partition scheme name
+- ``unbalance_sgm`` is parameter for unbalance partition
+- ``num_shards`` is parameter for non-iid partition using shards
+- ``dir_alpha`` is parameter for Dirichlet distribution used in partition
+- ``verbose`` controls whether to print intermediate information
+- ``seed`` sets the random seed
+
+Each partition scheme can be applied on CIFAR10 using different combinations of parameters:
+
+- ``balance=None``: do not specify sample numbers for each clients in advance
+
+  - ``partition="dirichlet"``: non-iid partition used in
+    :cite:t:`yurochkin2019bayesian` and :cite:t:`wang2020federated`. ``dir_alpha`` need to be specified in this partition scheme
+
+  - ``partition="shards"``: non-iid method used in FedAvg :cite:p:`mcmahan2017communication`. Refer to :func:`fedlab.utils.dataset.functional.shards_partition` for more information. ``num_shards`` need to be specified here.
+
+- ``balance=True``: "Balance" refers to FL scenario that sample numbers for different clients are the same. Refer to :func:`fedlab.utils.dataset.functional.balance_partition` for more information. This partition scheme is from :cite:t:`acar2020federated`.
+
+  - ``partition="iid"``: Random select samples from complete dataset given sample number for each client.
+
+  - ``partition="dirichlet"``: Refer to :func:`fedlab.utils.dataset.functional.client_inner_dirichlet_partition` for more information. ``dir_alpha`` need to be specified in this partition scheme
+
+- ``balance=False``: "Unbalance" refers to FL scenario that sample numbers for different clients are different. For unbalance method, sample number for each client is drown from Log-Normal distribution with variance ``unbalanced_sgm``. When ``unbalanced_sgm=0``, partition is balanced. This partition scheme is from :cite:t:`acar2020federated`.
+
+  - ``partition="iid"``: Random select samples from complete dataset given sample number for each client.
+
+  - ``partition="dirichlet"``: Given sample number of each client, use Dirichlet distribution for each client's class distribution. ``dir_alpha`` need to be specified in this partition scheme
+
+To conclude, 6 pre-defined partition schemes can be summarized as:
+
+- Hetero Dirichlet (non-iid)
+- Shards (non-iid)
+- Balanced IID (iid)
+- Unbalanced IID (iid)
+- Balanced Dirichlet (non-iid)
+- Unbalanced Dirichlet (non-iid)
+
+Now, we introduce how to use these pre-defined partition on CIFAR10 in FL setting with 100 clients, and provide statistical plots for each scheme.
+
+First, import related package and basic setting:
+
+.. code-block:: python
+
+    import torch
+    import torchvision
+
+    import numpy as np
+    import pandas as pd
+    import matplotlib.pyplot as plt
+    import seaborn as sns
+    import sys
+
+    from fedlab.utils.dataset.partition import CIFAR10Partitioner
+    from fedlab.utils.functional import partition_report, save_dict
+
+    num_clients = 100
+    num_classes = 10
+    seed = 2021
+    hist_color = '#4169E1'
+
+Second, we need to load CIFAR10 dataset from ``torchvision``:
+
+.. code-block:: python
+
+    trainset = torchvision.datasets.CIFAR10(root="../../../../data/CIFAR10/",
+                                            train=True, download=True)
+
+
+Hetero Dirichlet
+^^^^^^^^^^^^^^^^
+
+Perform partition:
+
+.. code-block:: python
+
+    hetero_dir_part = CIFAR10Partitioner(trainset.targets,
+                                         num_clients,
+                                         balance=None,
+                                         partition="dirichlet",
+                                         dir_alpha=0.3,
+                                         seed=seed)
+
+
+``hetero_dir_part.client_dict`` is a dictionary like this：
+
+.. code-block:: python
+
+    hetero_dir_part.client_dict= { 0: indices of dataset,
+                                   1: indices of dataset,
+                                   ...
+                                   100: indices of dataset }
+
+
+For visualization and check partition result, we generate partition report for current partition, and save it into csv file:
+
+.. code-block:: python
+
+    csv_file = "./partition-reports/cifar10_hetero_dir_0.3_100clients.csv"
+    partition_report(trainset.targets, hetero_dir_part.client_dict,
+                     class_num=num_classes,
+                     verbose=False, file=csv_file)
+
+Report generated here is like:
+
+.. code-block::
+
+    Class frequencies:
+    client,class0,class1,class2,class3,class4,class5,class6,class7,class8,class9,Amount
+    Client   0,0.170,0.00,0.104,0.00,0.145,0.004,0.340,0.041,0.075,0.120,241
+    Client   1,0.002,0.015,0.083,0.003,0.082,0.109,0.009,0.00,0.695,0.00,863
+    Client   2,0.120,0.759,0.122,0.00,0.00,0.00,0.00,0.00,0.00,0.00,526
+    ...
+
+which can be easily parsed by :func:`csv.reader` or :func:`pandas.read_csv`:
+
+.. code-block:: python
+
+    hetero_dir_part_df = pd.read_csv(csv_file,header=1)
+    hetero_dir_part_df = hetero_dir_part_df.set_index('client')
+    col_names = [f"class{i}" for i in range(num_classes)]
+    for col in col_names:
+        hetero_dir_part_df[col] = (hetero_dir_part_df[col] * hetero_dir_part_df['Amount']).astype(int)
+
+Now, select the first 10 clients for class distribution bar plot:
+
+.. code-block:: python
+
+    hetero_dir_part_df[col_names].iloc[:10].plot.barh(stacked=True)
+    plt.tight_layout()
+    plt.xlabel('sample num')
+    plt.savefig(f"./imgs/cifar10_hetero_dir_0.3_100clients.png", dpi=400)
+
+.. image:: ../../imgs/data-partition/cifar10_hetero_dir_0.3_100clients.png
+   :align: center
+   :width: 400
+
+We also can check sample number statistic result for all clients:
+
+.. code-block:: python
+
+    clt_sample_num_df = hetero_dir_part.client_sample_count
+    sns.histplot(data=clt_sample_num_df,
+                 x="num_samples",
+                 edgecolor='none',
+                 alpha=0.7,
+                 shrink=0.95,
+                 color=hist_color)
+    plt.savefig(f"./imgs/cifar10_hetero_dir_0.3_100clients_dist.png", dpi=400, bbox_inches = 'tight')
+
+.. image:: ../../imgs/data-partition/cifar10_hetero_dir_0.3_100clients_dist.png
+   :align: center
+   :width: 300
+
+Shards
+^^^^^^
+
+Perform partition:
+
+.. code-block:: python
+
+    num_shards = 200
+    shards_part = CIFAR10Partitioner(trainset.targets,
+                                     num_clients,
+                                     balance=None,
+                                     partition="shards",
+                                     num_shards=num_shards,
+                                     seed=seed)
+
+Class distribution bar plot:
+
+.. image:: ../../imgs/data-partition/cifar10_shards_200_100clients.png
+   :align: center
+   :width: 400
+
+
+Balanced IID
+^^^^^^^^^^^^
+
+Perform partition:
+
+.. code-block:: python
+
+    balance_iid_part = CIFAR10Partitioner(trainset.targets,
+                                          num_clients,
+                                          balance=True,
+                                          partition="iid",
+                                          seed=seed)
+
+Class distribution bar plot:
+
+.. image:: ../../imgs/data-partition/cifar10_balance_iid_100clients.png
+   :align: center
+   :width: 400
+
+Unbalanced IID
+^^^^^^^^^^^^^^
+
+Perform partition:
+
+.. code-block:: python
+
+    unbalance_iid_part = CIFAR10Partitioner(trainset.targets,
+                                            num_clients,
+                                            balance=False,
+                                            partition="iid",
+                                            unbalance_sgm=0.3,
+                                            seed=seed)
+
+Class distribution bar plot:
+
+.. image:: ../../imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients.png
+   :align: center
+   :width: 400
+
+Sample number statistic result for clients:
+
+.. image:: ../../imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients_dist.png
+   :align: center
+   :width: 300
+
+Balanced Dirichlet
+^^^^^^^^^^^^^^^^^^
+
+Perform partition:
+
+.. code-block:: python
+
+    balance_dir_part = CIFAR10Partitioner(trainset.targets,
+                                          num_clients,
+                                          balance=True,
+                                          partition="dirichlet",
+                                          dir_alpha=0.3,
+                                          seed=seed)
+
+Class distribution bar plot:
+
+.. image:: ../../imgs/data-partition/cifar10_balance_dir_alpha_0.3_100clients.png
+   :align: center
+   :width: 400
+
+
+Unbalanced Dirichlet
+^^^^^^^^^^^^^^^^^^^^
+
+Perform partition:
+
+.. code-block:: python
+
+    unbalance_dir_part = CIFAR10Partitioner(trainset.targets,
+                                            num_clients,
+                                            balance=False,
+                                            partition="dirichlet",
+                                            unbalance_sgm=0.3,
+                                            dir_alpha=0.3,
+                                            seed=seed)
+
+Class distribution bar plot:
 
+.. image:: ../../imgs/data-partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients.png
+   :align: center
+   :width: 400
 
+Sample number statistic result for clients:
 
+.. image:: ../../imgs/data-partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients_dist.png
+   :align: center
+   :width: 300
 
-In codes above, data\_indices is a dictionary (in order to ensure that the partition result is consistent in the case of cross-machine, the user can save the division dictionary in a file) like this：
+.. note::
 
-.. code:: python
+    For complete usage example of :class:`CIFAR10Partitioner`, check FedLab benchmark `datasets part <https://github.com/SMILELab-FL/FedLab-benchmarks/tree/main/fedlab_benchmarks/datasets/cifar10/>`_.
 
-    dict= { '0': indices of dataset,
-            '1': indices of dataset,
-            ...
-            'k': indices of dataset }
+SubsetSampler
+=============
 
 By using torch's sampler, only the right part of the sample is taken from the overall dataset.
 
-.. code:: python
+.. code-block:: python
 
     from fedlab.utils.dataset.sampler import SubsetSampler
 
-    train_loader = torch.utils.data.DataLoader(
-                    trainset,
-                    sampler=SubsetSampler(indices=data_slices[client_id],
-                                          shuffle=True),
-                    batch_size=batch_size)
+    train_loader = torch.utils.data.DataLoader(trainset,
+                                               sampler=SubsetSampler(indices=partition[client_id], shuffle=True),
+                                               batch_size=batch_size)
 
-There is also a similar implementation of directly reordering and partition the dataset, see fedlab.utils.dataset.sampler.RawPartitionSampler for details.
+There is also a similar implementation of directly reordering and partition the dataset, see :class:`fedlab.utils.dataset.sampler.RawPartitionSampler` for details.
 
 In addition to dividing the dataset by the sampler of torch, dataset can also be divided directly by splitting the dataset file. The implementation can refer to FedLab version of LEAF.
diff --git a/docs/source/tutorials/tutorial_2.rst b/docs/source/tutorials/tutorial_2.rst
@@ -94,7 +94,7 @@ procedure is shown as follows.
     :class: only-dark
 
 Initialization stage
-=======================
+====================
 
 Initialization stage is represented by :meth:`manager.setup()` function.