Skip to content

Commit

Permalink
Merge pull request #146 from SMILELab-FL/docs
Browse files Browse the repository at this point in the history
Docs
  • Loading branch information
dunzeng authored Sep 25, 2021
2 parents c8792c0 + 9054776 commit 0937b84
Show file tree
Hide file tree
Showing 12 changed files with 311 additions and 19 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/imgs/data-partition/cifar10_balance_iid_100clients.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/imgs/data-partition/cifar10_hetero_dir_0.3_100clients.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/imgs/data-partition/cifar10_shards_200_100clients.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 23 additions & 0 deletions docs/source/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,26 @@ @article{lin2017deep
journal={arXiv preprint arXiv:1712.01887},
year={2017}
}

@inproceedings{acar2020federated,
title={Federated learning based on dynamic regularization},
author={Acar, Durmus Alp Emre and Zhao, Yue and Matas, Ramon and Mattina, Matthew and Whatmough, Paul and Saligrama, Venkatesh},
booktitle={International Conference on Learning Representations},
year={2020}
}

@inproceedings{yurochkin2019bayesian,
title={Bayesian nonparametric federated learning of neural networks},
author={Yurochkin, Mikhail and Agarwal, Mayank and Ghosh, Soumya and Greenewald, Kristjan and Hoang, Nghia and Khazaeni, Yasaman},
booktitle={International Conference on Machine Learning},
pages={7252--7261},
year={2019},
organization={PMLR}
}

@article{wang2020federated,
title={Federated learning with matched averaging},
author={Wang, Hongyi and Yurochkin, Mikhail and Sun, Yuekai and Papailiopoulos, Dimitris and Khazaeni, Yasaman},
journal={arXiv preprint arXiv:2002.06440},
year={2020}
}
305 changes: 287 additions & 18 deletions docs/source/tutorials/data_partition.rst
Original file line number Diff line number Diff line change
@@ -1,41 +1,310 @@
.. _data-partition:

****************
Data Partitioner
****************
***************
DataPartitioner
***************

This chapter introduces the dataset partitioner ``DataPartitioner`` and how the client process uses the corresponding dataset. **FedLab** provides various methods to deal with different partition strategy corresponding with different dataset situations.
Sophisticated in real world, FL need to handle various kind of data distribution scenarios, including
iid and non-iid scenarios. Though there already exists some partition schemes for published data benchmark,
it still can be very messy and hard for researchers to partition datasets according to their specific
research problems, and maintain partition results during simulation. FedLab provides :class:`fedlab.utils.dataset.partition.DataPartitioner` that allows you to use pre-partitioned datasets as well as your own data. :class:`DataPartitioner` stores sample indices for each client given a data partition scheme.

FedLab provides a number of pre-defined partition schemes for some datasets (such as CIFAR10) that subclass :class:`fedlab.utils.dataset.partition.DataPartitioner` and implement functions specific to particular partition scheme. They can be used to prototype and benchmark your FL algorithms.

For classification datasets, FedLab provides noniid and random partition method.

CIFAR10Partitioner
==================

For CIFAR10, we provides 6 pre-defined partition schemes. We partition CIFAR10 with the following parameters:

- ``targets`` is labels of dataset to partition
- ``num_clients`` specifies number of clients in partition scheme
- ``balance`` refers to FL scenario that sample numbers for different clients are the same
- ``partition`` specifies partition scheme name
- ``unbalance_sgm`` is parameter for unbalance partition
- ``num_shards`` is parameter for non-iid partition using shards
- ``dir_alpha`` is parameter for Dirichlet distribution used in partition
- ``verbose`` controls whether to print intermediate information
- ``seed`` sets the random seed

Each partition scheme can be applied on CIFAR10 using different combinations of parameters:

- ``balance=None``: do not specify sample numbers for each clients in advance

- ``partition="dirichlet"``: non-iid partition used in
:cite:t:`yurochkin2019bayesian` and :cite:t:`wang2020federated`. ``dir_alpha`` need to be specified in this partition scheme

- ``partition="shards"``: non-iid method used in FedAvg :cite:p:`mcmahan2017communication`. Refer to :func:`fedlab.utils.dataset.functional.shards_partition` for more information. ``num_shards`` need to be specified here.

- ``balance=True``: "Balance" refers to FL scenario that sample numbers for different clients are the same. Refer to :func:`fedlab.utils.dataset.functional.balance_partition` for more information. This partition scheme is from :cite:t:`acar2020federated`.

- ``partition="iid"``: Random select samples from complete dataset given sample number for each client.

- ``partition="dirichlet"``: Refer to :func:`fedlab.utils.dataset.functional.client_inner_dirichlet_partition` for more information. ``dir_alpha`` need to be specified in this partition scheme

- ``balance=False``: "Unbalance" refers to FL scenario that sample numbers for different clients are different. For unbalance method, sample number for each client is drown from Log-Normal distribution with variance ``unbalanced_sgm``. When ``unbalanced_sgm=0``, partition is balanced. This partition scheme is from :cite:t:`acar2020federated`.

- ``partition="iid"``: Random select samples from complete dataset given sample number for each client.

- ``partition="dirichlet"``: Given sample number of each client, use Dirichlet distribution for each client's class distribution. ``dir_alpha`` need to be specified in this partition scheme

To conclude, 6 pre-defined partition schemes can be summarized as:

- Hetero Dirichlet (non-iid)
- Shards (non-iid)
- Balanced IID (iid)
- Unbalanced IID (iid)
- Balanced Dirichlet (non-iid)
- Unbalanced Dirichlet (non-iid)

Now, we introduce how to use these pre-defined partition on CIFAR10 in FL setting with 100 clients, and provide statistical plots for each scheme.

First, import related package and basic setting:

.. code-block:: python
import torch
import torchvision
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from fedlab.utils.dataset.partition import CIFAR10Partitioner
from fedlab.utils.functional import partition_report, save_dict
num_clients = 100
num_classes = 10
seed = 2021
hist_color = '#4169E1'
Second, we need to load CIFAR10 dataset from ``torchvision``:

.. code-block:: python
trainset = torchvision.datasets.CIFAR10(root="../../../../data/CIFAR10/",
train=True, download=True)
Hetero Dirichlet
^^^^^^^^^^^^^^^^

Perform partition:

.. code-block:: python
hetero_dir_part = CIFAR10Partitioner(trainset.targets,
num_clients,
balance=None,
partition="dirichlet",
dir_alpha=0.3,
seed=seed)
``hetero_dir_part.client_dict`` is a dictionary like this:

.. code-block:: python
hetero_dir_part.client_dict= { 0: indices of dataset,
1: indices of dataset,
...
100: indices of dataset }
For visualization and check partition result, we generate partition report for current partition, and save it into csv file:

.. code-block:: python
csv_file = "./partition-reports/cifar10_hetero_dir_0.3_100clients.csv"
partition_report(trainset.targets, hetero_dir_part.client_dict,
class_num=num_classes,
verbose=False, file=csv_file)
Report generated here is like:

.. code-block::
Class frequencies:
client,class0,class1,class2,class3,class4,class5,class6,class7,class8,class9,Amount
Client 0,0.170,0.00,0.104,0.00,0.145,0.004,0.340,0.041,0.075,0.120,241
Client 1,0.002,0.015,0.083,0.003,0.082,0.109,0.009,0.00,0.695,0.00,863
Client 2,0.120,0.759,0.122,0.00,0.00,0.00,0.00,0.00,0.00,0.00,526
...
which can be easily parsed by :func:`csv.reader` or :func:`pandas.read_csv`:

.. code-block:: python
hetero_dir_part_df = pd.read_csv(csv_file,header=1)
hetero_dir_part_df = hetero_dir_part_df.set_index('client')
col_names = [f"class{i}" for i in range(num_classes)]
for col in col_names:
hetero_dir_part_df[col] = (hetero_dir_part_df[col] * hetero_dir_part_df['Amount']).astype(int)
Now, select the first 10 clients for class distribution bar plot:

.. code-block:: python
hetero_dir_part_df[col_names].iloc[:10].plot.barh(stacked=True)
plt.tight_layout()
plt.xlabel('sample num')
plt.savefig(f"./imgs/cifar10_hetero_dir_0.3_100clients.png", dpi=400)
.. image:: ../../imgs/data-partition/cifar10_hetero_dir_0.3_100clients.png
:align: center
:width: 400

We also can check sample number statistic result for all clients:

.. code-block:: python
clt_sample_num_df = hetero_dir_part.client_sample_count
sns.histplot(data=clt_sample_num_df,
x="num_samples",
edgecolor='none',
alpha=0.7,
shrink=0.95,
color=hist_color)
plt.savefig(f"./imgs/cifar10_hetero_dir_0.3_100clients_dist.png", dpi=400, bbox_inches = 'tight')
.. image:: ../../imgs/data-partition/cifar10_hetero_dir_0.3_100clients_dist.png
:align: center
:width: 300

Shards
^^^^^^

Perform partition:

.. code-block:: python
num_shards = 200
shards_part = CIFAR10Partitioner(trainset.targets,
num_clients,
balance=None,
partition="shards",
num_shards=num_shards,
seed=seed)
Class distribution bar plot:

.. image:: ../../imgs/data-partition/cifar10_shards_200_100clients.png
:align: center
:width: 400


Balanced IID
^^^^^^^^^^^^

Perform partition:

.. code-block:: python
balance_iid_part = CIFAR10Partitioner(trainset.targets,
num_clients,
balance=True,
partition="iid",
seed=seed)
Class distribution bar plot:

.. image:: ../../imgs/data-partition/cifar10_balance_iid_100clients.png
:align: center
:width: 400

Unbalanced IID
^^^^^^^^^^^^^^

Perform partition:

.. code-block:: python
unbalance_iid_part = CIFAR10Partitioner(trainset.targets,
num_clients,
balance=False,
partition="iid",
unbalance_sgm=0.3,
seed=seed)
Class distribution bar plot:

.. image:: ../../imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients.png
:align: center
:width: 400

Sample number statistic result for clients:

.. image:: ../../imgs/data-partition/cifar10_unbalance_iid_unbalance_sgm_0.3_100clients_dist.png
:align: center
:width: 300

Balanced Dirichlet
^^^^^^^^^^^^^^^^^^

Perform partition:

.. code-block:: python
balance_dir_part = CIFAR10Partitioner(trainset.targets,
num_clients,
balance=True,
partition="dirichlet",
dir_alpha=0.3,
seed=seed)
Class distribution bar plot:

.. image:: ../../imgs/data-partition/cifar10_balance_dir_alpha_0.3_100clients.png
:align: center
:width: 400


Unbalanced Dirichlet
^^^^^^^^^^^^^^^^^^^^

Perform partition:

.. code-block:: python
unbalance_dir_part = CIFAR10Partitioner(trainset.targets,
num_clients,
balance=False,
partition="dirichlet",
unbalance_sgm=0.3,
dir_alpha=0.3,
seed=seed)
Class distribution bar plot:

.. image:: ../../imgs/data-partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients.png
:align: center
:width: 400

Sample number statistic result for clients:

.. image:: ../../imgs/data-partition/cifar10_unbalance_dir_alpha_0.3_unbalance_sgm_0.3_100clients_dist.png
:align: center
:width: 300

In codes above, data\_indices is a dictionary (in order to ensure that the partition result is consistent in the case of cross-machine, the user can save the division dictionary in a file) like this:
.. note::

.. code:: python
For complete usage example of :class:`CIFAR10Partitioner`, check FedLab benchmark `datasets part <https://github.com/SMILELab-FL/FedLab-benchmarks/tree/main/fedlab_benchmarks/datasets/cifar10/>`_.

dict= { '0': indices of dataset,
'1': indices of dataset,
...
'k': indices of dataset }
SubsetSampler
=============

By using torch's sampler, only the right part of the sample is taken from the overall dataset.

.. code:: python
.. code-block:: python
from fedlab.utils.dataset.sampler import SubsetSampler
train_loader = torch.utils.data.DataLoader(
trainset,
sampler=SubsetSampler(indices=data_slices[client_id],
shuffle=True),
batch_size=batch_size)
train_loader = torch.utils.data.DataLoader(trainset,
sampler=SubsetSampler(indices=partition[client_id], shuffle=True),
batch_size=batch_size)
There is also a similar implementation of directly reordering and partition the dataset, see fedlab.utils.dataset.sampler.RawPartitionSampler for details.
There is also a similar implementation of directly reordering and partition the dataset, see :class:`fedlab.utils.dataset.sampler.RawPartitionSampler` for details.

In addition to dividing the dataset by the sampler of torch, dataset can also be divided directly by splitting the dataset file. The implementation can refer to FedLab version of LEAF.
2 changes: 1 addition & 1 deletion docs/source/tutorials/tutorial_2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ procedure is shown as follows.
:class: only-dark

Initialization stage
=======================
====================

Initialization stage is represented by :meth:`manager.setup()` function.

Expand Down

0 comments on commit 0937b84

Please sign in to comment.