docs: update datasets tutorial (#1569)

* docs: update datasets tutorial * docs: incorporate feedback by @MalteEbner Co-authored-by: Malte Ebner <[email protected]> * docs: use native torchvision dataset * docs: incorporate feedback by @guarin Co-authored-by: guarin <[email protected]> * docs: update tutorial Co-authored-by: guarin <[email protected]> * docs: add transforms to LightlyDataset * docs: use simclr transforms --------- Co-authored-by: Malte Ebner <[email protected]> Co-authored-by: guarin <[email protected]>
lightly-ai · Jul 8, 2024 · 6b7d83d · 6b7d83d
1 parent cbd5495
commit 6b7d83d
Showing 1 changed file with 182 additions and 103 deletions.
diff --git a/docs/source/tutorials/structure_your_input.rst b/docs/source/tutorials/structure_your_input.rst
@@ -3,103 +3,72 @@
 Tutorial 1: Structure Your Input
 ================================
 
-If you are familiar with torch-like image dataset, you can skip this tutorial and
-jump right into training a model:
-
-- :ref:`lightly-moco-tutorial-2`
-- :ref:`lightly-simclr-tutorial-3`  
-- :ref:`lightly-simsiam-tutorial-4`
-- :ref:`lightly-custom-augmentation-5`
-- :ref:`lightly-detectron-tutorial-6`
+The modern-day open-source ecosystem has changed a lot over the years, and there are now
+many viable options for data pipelining. The `torchvision.data <https://pytorch.org/vision/main/datasets.html>`_ submodule provides a robust implementation for most use cases,
+and the `Hugging Face Hub <https://hf.co>`_ has emerged as a growing collection of datasets that span a variety of domains and tasks.
+It you want to use your own data, the ability to quickly create datasets and dataloaders is of prime importance.
 
-If you are looking for a use case that's not covered by the above tutorials please
-let us know by `creating an issue <https://github.com/lightly-ai/lightly/issues/new>`_
-for it.
+In this tutorial, we will provide a brief overview of the `LightlyDataset <https://docs.lightly.ai/self-supervised-learning/lightly.data.html#lightly.data.dataset.LightlyDataset>`_
+and go through examples of using datasets from various open-source libraries such as PyTorch and
+Hugging Face with Lightly SSL. We will also look into how we can create dataloaders
+for video tasks while incorporating weak labels.
 
 
-Supported File Types
---------------------
+LightlyDataset
+--------------
 
-By default, the `Lightly SSL Python package <https://pypi.org/project/lightly/>`_ 
-can process images or videos for self-supervised learning or for generating embeddings.
-You can always write your own torch-like dataset to use other file types.
+The LightlyDataset class aims to provide a uniform data interface for all models and functions in the Lightly SSL package.
+It allows us to create both image and video dataset classes with or without labels.
 
-Images
-^^^^^^^^^^^^^^^^^^^^^
+Supported File Types
+^^^^^^^^^^^^^^^^^^^^
 
-Since Lightly SSL uses `Pillow <https://github.com/python-pillow/Pillow>`_ 
-for image loading, it also supports all the image formats supported by 
-`Pillow <https://github.com/python-pillow/Pillow>`_.
+Since Lightly SSL uses `Pillow <https://github.com/python-pillow/Pillow>`_
+for image loading, it supports all the image formats supported by Pillow.
 
 - .jpg, .png, .tiff and 
   `many more <https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html>`_
 
-Videos
-^^^^^^^^^^^^^^^^^^^^^
-
-To load videos directly, Lightly SSL uses 
-`torchvision <https://github.com/pytorch/vision>`_ and 
-`PyAV <https://github.com/PyAV-Org/PyAV>`_. The following formats are supported.
+Lightly SSL uses `torchvision <https://github.com/pytorch/vision>`_ and
+`PyAV <https://github.com/PyAV-Org/PyAV>`_ for video loading. The following formats are supported.
 
 - .mov, .mp4 and .avi
 
+Unlabeled Image Datasets
+^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Image Folder Datasets
----------------------
-
-Image folder datasets contain raw images and are typically specified with the `input_dir` key-word.
-
+Creating an unlabeled image dataset is the simplest and the most common use case.
 
-Flat Directory Containing Images
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Assuming all images are in a single directory, you can simply pass in the path to the directory containing all the images.
 
-You can store all images of interest in a single folder without additional hierarchy. For example below,
-Lightly SSL will load all filenames and images in the directory `data/`. Additionally, it will assign all images
-a placeholder label.
+.. code-block:: python
 
-.. code-block:: bash
+    from lightly.data import LightlyDataset
+    from lightly.transforms import SimCLRTransform
 
-    # a single directory containing all images
-    data/
-    +--- img-1.jpg
-    +--- img-2.jpg
-    ...
-    +--- img-N.jpg
+    transform = SimCLRTransform()
+    dataset = LightlyDataset(input_dir='image_dir/', transform=transform)
 
-For the structure above, Lightly SSL will understand the input as follows:
-
-.. code-block:: python
+.. note::
 
-    filenames = [
-        'img-1.jpg',
-        'img-2.jpg',
-        ...
-        'img-N.jpg',
-    ]
+    Internally each image will get assigned a default label of 0
 
-    labels = [
-        0,
-        0,
-        ...
-        0,
-    ]
+Labeled Image Datasets
+^^^^^^^^^^^^^^^^^^^^^^
 
-Directory with Subdirectories Containing Images
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-You can give structure to your input directory by collecting the input images in subdirectories. In this case,
-the filenames loaded by Lightly SSL are with respect to the "root directory" `data/`. Furthermore, Lightly SSL assigns
-each image a so-called "weak-label" indicating to which subdirectory it belongs.
+If you have (weak) labels for your images and if each label has its directory,
+then you can simply pass in the path to the parent directory,
+and the dataset class will assign each image its subdirectory as label.
 
 .. code-block:: bash
 
     # directory with subdirectories containing all images
-    data/
+    labeled_image_dir/
     +-- weak-label-1/
         +-- img-1.jpg
         +-- img-2.jpg
         ...
-        +-- img-N1.jpg
+        +-- img-N1.jpg
     +-- weak-label-2/
         +-- img-1.jpg
         +-- img-2.jpg
@@ -114,58 +83,168 @@ each image a so-called "weak-label" indicating to which subdirectory it belongs.
         ...
         +-- img-N10.jpg
 
-For the structure above, Lightly SSL will understand the input as follows:
-
 .. code-block:: python
 
-    filenames = [
-        'weak-label-1/img-1.jpg',
-        'weak-label-1/img-2.jpg',
-        ...
-        'weak-label-1/img-N1.jpg',
-        'weak-label-2/img-1.jpg',
-        ...
-        'weak-label-2/img-N2.jpg',
-        ...
-        'weak-label-10/img-N10.jpg',
-    ]
+    from lightly.data import LightlyDataset
+    from lightly.transforms import SimCLRTransform
 
-    labels = [
-        0,
-        0,
-        ...
-        0,
-        1,
-        ...
-        1,
-        ...
-        9,
-    ]
+    transform = SimCLRTransform()
+    labeled_dataset = LightlyDataset(input_dir='labeled_image_dir/', transform=transform)
+
+Video Datasets
+^^^^^^^^^^^^^^
 
-Video Folder Datasets
----------------------
-The Lightly SSL Python package allows you to work `directly` on video data, without having
-to exctract the frames first. This can save a lot of disk space as video files are
-typically strongly compressed. Using Lightly SSL on video data is as simple as pointing 
-the software at an input directory where one or more videos are stored. The package will
-automatically detect all video files and index them so that each frame can be accessed.
+The Lightly SSL package also has native support for videos (`.mov`, `.mp4`, and `.avi` file extensions are supported),
+without having to extract the frames first. This can save a lot of disk space as video files are
+typically strongly compressed. No matter if your videos are in one flat directory or distributed across subdirectories,
+you can simply pass the path into the LightlyDataset constructor.
 
 An example for an input directory with videos could look like this:
 
 .. code-block:: bash
 
-    data/
+    video_dir/
     +-- my_video_1.mov
     +-- my_video_2.mp4
     +-- subdir/
         +-- my_video_3.avi
         +-- my_video_4.avi
 
-We assign a weak label to each video.
+.. code-block:: python
+
+   from lightly.data import LightlyDataset
+   from lightly.transforms import SimCLRTransform
+
+   transform = SimCLRTransform()
+   video_dataset = LightlyDataset(input_dir='video_dir/', transform=transform)
+
+The dataset assigns each video frame its video as label.
+
+.. note::
+
+   To use video-specific features of Lightly SSL download the necessary extra dependencies `pip install "lightly[video]"`. Furthermore,
+   randomly accessing video frames is slower compared to accessing the extracted frames on disk. However,
+   by working directly on video files, one can save a lot of disk space because the frames do not have to
+   be extracted beforehand.
 
+PyTorch Datasets
+----------------
+
+You can also use native `torchvision <https://pytorch.org/vision/main/datasets.html>`_ datasets with Lightly SSL directly.
+Just create a dataset as you normally would and apply transforms for greater control over the dataloader. For example, the
+:ref:`simclr` self-supervised learning method expects two views of an input image. To achieve this, we can use the `SimCLRTransform`
+while creating the dataset instance, which will lead to the dataloader returning two views per batch.
+
+.. code-block:: python
+
+   import torchvision
+   from lightly.transforms import SimCLRTransform
+
+   transform = SimCLRTransform(input_size=32, gaussian_blur=0.0)
+   dataset = torchvision.datasets.CIFAR10(
+       "datasets/cifar10", download=True, transform=transform
+   )
+
+
+Hugging Face Datasets
+--------------------
+
+To use a dataset from the Hugging Face Hub 🤗, we can simply apply the desired transformations using the
+`set_transform <https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.set_transform>`_
+helper method and then create a native PyTorch dataloader.
+
+
+.. code-block:: python
+
+    import torch
+    from typing import Any, Dict
+    from datasets import load_dataset
+    from lightly.transforms import SimCLRTransform
+
+    dataset = load_dataset("uoft-cs/cifar10", trust_remote_code=True)
+
+    ## Use pre-defined set of transformations from Lightly SSL
+    transform = SimCLRTransform()
+
+    def apply_transform(batch: Dict[str, Any])-> Dict[str, Any]:
+        """
+        Applies the given transform on all elements of batch["image"].
+        """
+        assert "image" in example_batch, "batch must contain key 'image'"
+        batch["image"] = [transform(img.convert("RGB")) for img in batch["image"]]
+        return batch
+
+    dataset.set_transform(apply_transform)
+    dataloader = torch.utils.data.DataLoader(dataset["train"])
+
+Image Augmentations
+-------------------
+
+Many SSL methods leverage image augmentations to better learn invariances in the training process. For example,
+by using different crops of a given image, the SSL model will be trained to produce a representation
+that is invariant to these different crops. When using a operation such as grayscale or colorjitter as augmentation,
+it will produce a representation that is invariant to the color information [1]_.
+
+We can use off the shelf augmentations from libraries like `torchvision transforms <https://pytorch.org/vision/stable/transforms.html>`_
+and `albumentations <https://albumentations.ai/docs/>`_ or the ones offered by Lightly SSL's
+`transforms <https://docs.lightly.ai/self-supervised-learning/lightly.transforms.html>`_ submodule while creating our datasets.
+
+.. code-block:: python
+
+    import albumentations as A
+    import torchvision.transforms as T
+    from albumentations.pytorch import ToTensorV2
+
+    from lightly.data import LightlyDataset
+
+    ## Torchvision Transforms
+    torchvision_transform = T.Compose(
+        [
+            T.RandomHorizontalFlip(),
+            T.ToTensor(),
+        ]
+    )
+
+    ## Albumentation Transforms
+    albumentation_transform = A.Compose(
+        [
+            A.CenterCrop(height=128, width=128),
+            A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
+            ToTensorV2(),
+        ]
+    )
+
+    ## Lightly Transforms
+    lightly_transform = SimCLRTransform()
+    
+    # Datasets and transforms can be mixed and matched together:
+    dataset = LightlyDataset(input_dir="image_dir/", transform=torchvision_transform)
+    dataset = torchvision.datasets.CIFAR10("datasets/cifar10", transform=lightly_transform)
 
 .. note::
 
-    Randomly accessing video frames is slower compared to accessing the extracted frames on disk. However,
-    by working directly on video files, one can save a lot of disk space because the frames do not have to
-    be extracted beforehand.
+   You can also create your own SSL augmentations, for more details please refer to :ref:`lightly-custom-augmentation-5`
+
+
+Conclusion
+----------
+
+In this tutorial, we went through examples of using various open-source packages to create datasets and dataloaders with Lightly SSL,
+and how they can be used in a training pipeline. We saw how Lightly SSL is flexible enough to work with all major data sources,
+and how we can write training pipelines that work with any format.
+
+Now that we are are familiar with creating datasets and dataloaders, lets'
+jump right into training a model:
+
+- :ref:`lightly-moco-tutorial-2`
+- :ref:`lightly-simclr-tutorial-3`
+- :ref:`lightly-simsiam-tutorial-4`
+- :ref:`lightly-custom-augmentation-5`
+- :ref:`lightly-detectron-tutorial-6`
+
+If you are looking for a use case that's not covered by the above tutorials please
+let us know by `creating an issue <https://github.com/lightly-ai/lightly/issues/new>`_
+for it.
+
+.. [1] Section 3.1, Role of Data Augmentation. A Cookbook of Self-Supervised Learning (arXiv:2304.12210)
+