[doc] Add DataPipeline + Callbacks + Registry #207

tchaton · 2021-04-09T11:11:27Z

What does this PR do?

Docs docs docs

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

codecov · 2021-04-09T11:16:02Z

Codecov Report

Merging #207 (c096c6f) into master (4a2aa47) will decrease coverage by 0.19%.
The diff coverage is 89.83%.

@@            Coverage Diff             @@
##           master     #207      +/-   ##
==========================================
- Coverage   86.53%   86.33%   -0.20%     
==========================================
  Files          57       57              
  Lines        2732     2759      +27     
==========================================
+ Hits         2364     2382      +18     
- Misses        368      377       +9

Flag	Coverage Δ
unittests	`86.33% <89.83%> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
flash/__init__.py	`100.00% <ø> (ø)`
flash/vision/backbones.py	`79.72% <66.66%> (-1.43%)`	⬇️
flash/data/data_pipeline.py	`88.04% <75.00%> (+0.67%)`	⬆️
flash/data/data_module.py	`77.66% <86.11%> (-0.94%)`	⬇️
flash/data/callback.py	`98.70% <97.56%> (-1.30%)`	⬇️
flash/core/model.py	`92.21% <100.00%> (-1.77%)`	⬇️
flash/core/registry.py	`100.00% <100.00%> (ø)`
flash/data/base_viz.py	`100.00% <100.00%> (+1.75%)`	⬆️
flash/data/process.py	`84.41% <100.00%> (ø)`
flash/data/utils.py	`96.87% <100.00%> (+0.03%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a2aa47...c096c6f. Read the comment docs.

pep8speaks · 2021-04-09T17:13:02Z

Hello @tchaton! Thanks for updating this PR.

In the file flash/core/model.py:

Line 182:13: W503 line break before binary operator

Comment last updated at 2021-04-13 12:46:25 UTC

docs/source/general/callback.rst

edenlightning · 2021-04-09T19:16:17Z

docs/source/general/data.rst

+*******************************
+Using DataModule + DataPipeline
+*******************************
+


This is missing context.

What are these objects? Why do they exist? when do I need to use them?

I think we can do something like

How to use out-of-the-box flashdatamodules

How to customize existing datamodules

How to build a datamodule for a new task

What are data pipelines and when are they required

how to use them

Yes. Definitely. I will work on this next week :)

flash/data/process.py

edenlightning · 2021-04-09T19:39:46Z

flash/data/process.py

+
+    The :class:`~flash.data.process.Preprocess` is currently supporting the following hooks:
+
+        - ``load_data``: Expect some metadata and return an Mapping (can be a ``Dataset``, but not recommended)


Not clear what is "some metadata" and what mapping. can we be specific?

Added an example ? Is it better ?

edenlightning · 2021-04-09T20:01:07Z

This is a lot of information :)

As I see it there are 3 type of documentation:

API (what is the class/method doing, what are the inputs and outputs)
How to use it
How it works

Right now looks like this is all together in the autodoc.
I suggest we separate this.

The autodoc should contain ONLY the API stuff, not how it works etc.

Then, the rest files need to include two layers of abstraction:

How to use this API
How it works behind the scenes

When is this API needed? If you are building a task or just using custom dataset? we might want that separate too

Also, would maybe be good to explain somehwre the relationship between hooks. For example, is load_sample being called by load_data?

flash/data/process.py

edenlightning · 2021-04-09T19:50:08Z

flash/data/process.py

+
+    Each hook can be made specialized by adding prefix such as ``train``, ``val``, ``test``, ``predict``.
+
+    Example::


Explain this. For example, if we want the data loading at training to do bla bla bla then ....

flash/data/process.py

edenlightning · 2021-04-09T19:51:59Z

flash/data/process.py

+
+    .. note::
+
+        It is possible to wrap a ``Dataset`` within a :class:`~flash.data.process.Preprocess` ``load_data`` function.


Please clarify.
When would you want to wrap a dataset? is it recommended or not?

Not recommended. I added it there.

edenlightning · 2021-04-09T19:53:58Z

flash/data/process.py

+        The ``pre_tensor_transform``, ``to_tensor_transform``, ``post_tensor_transform``, ``collate``, ``per_batch_transform``
+        are injected as the ``collate_fn`` function of the DataLoader.
+
+    Here is the pseudo code using the preprocess hooks name. Surely, Flash will take care of calling the right hooks for each stage.


This is def a lot for autodoc. Should we move this to the rst file? Under "How this works under the hood" (if it is necesary)

Yes, I wrote everything in one place to not forget anything and will split it after.

edenlightning · 2021-04-09T19:57:21Z

flash/data/process.py

+
+                return datasets.MNIST(path_to_data, download=True, transform=transforms.ToTensor())
+
+    .. note:: The ``load_data`` and ``load_sample`` will be used to generate an AutoDataset object.


This is growing way to long. Is this just a metadata object? If so let;s add this description inside the hook. Please explain it in words not in pseudo code.

awaelchli

I agree with Eden, it would be better to separate user guide from API doc strings.

docs/source/custom_task.rst

awaelchli · 2021-04-12T01:47:41Z

flash/data/process.py

+
+            def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Iterable:
+
+                print(self.current_fn)


why the need for current_fn? when would we need that?
one can always get the name of the current function through __name__
Tracking the current function being executed ourself could be very fragile.

When running inside a hook, how would you get the hook name ?

It is used to computer vision where transforms are provided as dictionary with key: hook_name value: transforms.

It enables to implement a current_transform available for the users directly: https://github.com/PyTorchLightning/lightning-flash/blob/4a2aa476c2dcf56d9384c969fc351ea7f46a3b3a/flash/data/process.py#L140

awaelchli · 2021-04-12T01:56:58Z

flash/core/registry.py

@@ -123,12 +123,17 @@ def _find_matching_index(self, item: _REGISTERED_FUNCTION) -> Optional[int]:

    def __call__(


not sure if __call__ is very intuitive. I would expect a .add or similar.

Yes, we had several chat about this but it makes it nicer when using decorator.

We could support both?

awaelchli · 2021-04-12T02:09:16Z

The data pipeline offers great features, and it looks very powerful. Thanks for that, looking forward to using it.

The user guide in the class docstrings is making it a bit hard to follow how Preprocess and other components relate to each other. I unfortunately did not understand the motivation behind registry. I suppose this is here to simplify how users select different backbones? Maybe one can show a before and after to highlight how Flash simplifies it for the user.

edenlightning · 2021-04-12T20:41:20Z

docs/source/general/registry.rst

+Available Registries
+********************
+
+Registries are Flash internal key-value database to store a mapping between a name and a function.


Is it only meant to store backbones? If soo, let's be explicit about it.

edenlightning · 2021-04-12T20:44:54Z

docs/source/general/registry.rst

+
+In simple words, they are just advanced dictionary storing a function from a key string.
+
+Registries help organize code and make the functions accessible all across the ``Flash`` codebase.


maybe add:

If you are creating a custom backbone and want to easily be able to use it in your flash task by simply providing it as a string, you should add it to a registry. You can do that in a few simple steps:

import...

init registry

Add your function as decortaor or directly

you can now access your function from your task!

(add snippets for each step ofcourse)

edenlightning · 2021-04-12T20:46:20Z

docs/source/index.rst

@@ -34,6 +34,9 @@ Lightning Flash

   general/model
   general/data
+   general/callback


What do you think about creating an "Advanced" submenu here that will have all the API + creating a custom task tutorial?

Yes, we will do so in another PR. Definitely need some re-organizations.

edenlightning · 2021-04-12T20:51:05Z

docs/source/general/data.rst

+        The :class:`~flash.data.process.Postprocess` hooks covers from model outputs to predictions export.
+
+*******************************************
+How to use out-of-the-box flashdatamodules


Move this to the top of the page.

It should be

Data

Flash has 2 main APIs for handling data: DataModules and Data Pipeline.

DataModules contain the dataset and information for loading the dataset, and DataPipeline contains all preprocessing and postprocessing logic.

Using built-in DataModules

Flash has built-in datamodules you can use to load your data to any existing flash task. For more info, check out....

Customize your data

If you are creating a new task or using a different type of dataset for an existing task, you may want to override ....

and then add the terminology table and then

Create custom DataModule

Create custom Data pipeline

Custom DataModule + data Pipeline

edenlightning · 2021-04-12T20:53:31Z

docs/source/general/data.rst

+Checkout the :ref:`image_classification` section or any other tasks to learn more about them.
+
+********************************
+Why Preprocess and PostProcess ?


Suggested change

Why Preprocess and PostProcess ?

Data Processing

edenlightning · 2021-04-12T20:54:06Z

docs/source/general/data.rst

+Currently, it is common practice to implement a `Dataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset>`_
+and provide them to a `DataLoader <https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader>`_.
+
+However, after model training, it requires a lot of engineering overhead to make inference on raw data and deploy the model in production environnement.


Explain why? Since the pre+post processing steps are not included, etc

docs/source/general/data.rst

edenlightning · 2021-04-12T21:15:39Z

docs/source/general/data.rst

+Custom Preprocess + Datamodule
+******************************
+
+The example below shows a very simple ``ImageClassificationPreprocess`` with a ``ImageClassificationDataModule``.


Always start with the motivation. Start by explaining it in general, and then move to the example.

If you want to add custom bla and custom bla, you might need to create custom dm and a custom datapipeline.

Create a DataModule - The DataModule should have helper functions to instantiate the DataModule for a relevant input type (folder structure, csv, numpy array, etc). Let's say in our example we have folders of images arranged like.... then we will create a from_folders.... etc

Create the preprocess- your task might need some preprocess and post process transforms. In this example.... bla bla bla

Make sure the processing API is used in your datamodule

Use your new datamodule in your task

edenlightning · 2021-04-12T21:18:22Z

flash/data/process.py

+    The :class:`~flash.data.process.Preprocess` is used to encapsulate
+    all the processing data loading logic up to the model.


Suggested change

The :class:`~flash.data.process.Preprocess` is used to encapsulate

all the processing data loading logic up to the model.

The :class:`~flash.data.process.Preprocess` encapsulates

all the data processing and loading logic that should run before the data is passed to the model.

edenlightning · 2021-04-12T21:26:17Z

flash/data/process.py

+
+                * Output: Return the tensored image and its label.
+
+        - ``per_batch_transform``: Performs transforms on a batch.


which example? i don't think this makes sense in autodoc

edenlightning · 2021-04-12T21:27:00Z

flash/data/process.py

+
+    .. note::
+
+        By default, each hook will be no-op execpt the collate which is PyTorch default collate.


make the collate a link to the pytorch method.

edenlightning · 2021-04-12T21:29:19Z

flash/data/process.py

+    It is particularly relevant when you want to provide an end to end implementation which works
+    with 4 different stages: ``train``, ``validation``, ``test``,  and inference (``predict``).
+
+    The :class:`~flash.data.process.Preprocess` supports the following hooks:


Let's just add here something like:
You can override any of the preprocessing hooks to provide custom functionality. All hooks default to no-op (except....)

And then add a snippet example of a full custom Preprocess.

With the snippet you can order them in the order in which they are being called.

All the hooks can be explained in their own autodocs.

Let's clean this out in another PR.

edenlightning · 2021-04-12T21:30:58Z

flash/data/process.py

@@ -168,7 +313,7 @@ def add_callbacks(self, callbacks: List['FlashCallback']):
        self._callbacks.extend(_callbacks)

    @classmethod
-    def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Any:
+    def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Mapping:
        """Loads entire data from Dataset"""


Add more details. What kind of input can you pass in?

How to create the mapping?
+code example

edenlightning · 2021-04-12T21:32:44Z

flash/data/process.py

@@ -178,17 +323,22 @@ def load_sample(cls, sample: Any, dataset: Optional[Any] = None) -> Any:
        return sample


same here. more details, return type, code snippet

edenlightning · 2021-04-12T21:33:01Z

flash/data/process.py

@@ -178,17 +323,22 @@ def load_sample(cls, sample: Any, dataset: Optional[Any] = None) -> Any:
        return sample

    def pre_tensor_transform(self, sample: Any) -> Any:
+        """Transforms to apply on a single object."""


return type, code snippet

edenlightning · 2021-04-12T21:33:32Z

flash/data/process.py

        return sample

    def to_tensor_transform(self, sample: Any) -> Tensor:
+        """Transforms to convert single object to a tensor."""


return type, code snippet

edenlightning · 2021-04-12T21:33:46Z

flash/data/process.py

        return sample

    def post_tensor_transform(self, sample: Tensor) -> Tensor:
+        """Transforms to apply on a tensor."""


return type, code snippet

edenlightning · 2021-04-12T21:34:19Z

flash/data/process.py

        return sample

    def post_tensor_transform(self, sample: Tensor) -> Tensor:
+        """Transforms to apply on a tensor."""
        return sample

    def per_batch_transform(self, batch: Any) -> Any:
        """Transforms to apply to a whole batch (if possible use this for efficiency).


return type, code snippet

edenlightning · 2021-04-12T21:34:34Z

flash/data/process.py

@@ -199,10 +349,14 @@ def collate(self, samples: Sequence) -> Any:

    def per_sample_transform_on_device(self, sample: Any) -> Any:
        """Transforms to apply to the data before the collation (per-sample basis).


return type, code snippet

edenlightning · 2021-04-12T21:34:45Z

flash/data/process.py

@@ -211,7 +365,9 @@ def per_sample_transform_on_device(self, sample: Any) -> Any:
    def per_batch_transform_on_device(self, batch: Any) -> Any:
        """
        Transforms to apply to a whole batch (if possible use this for efficiency).


return type, code snippet

edgarriba · 2021-04-13T09:47:00Z

flash/data/base_viz.py

-    def on_post_tensor_transform(self, sample: Tensor, running_stage: RunningStage) -> None:
-        self._store(sample, "post_tensor_transform", running_stage)
+        # visualize a ``train`` batch
+        dm.show_train_batches()


@tchaton I'm trying to replicate this exact example and I get:
AttributeError: 'CustomImageClassificationData object has no attribute 'show_train_batches'

You have the test_base_data_fetcher if you want to check it out.

edgarriba · 2021-04-13T10:22:03Z

docs/source/general/data.rst

+        # Assuming you have images in numpy format,
+        # just override ``load_sample`` hook and add your own logic.
+        @staticmethod
+        def load_sample(sample) -> Tuple[Image.Image, int]:


Suggested change

def load_sample(sample) -> Tuple[Image.Image, int]:

def load_sample(sample) -> Tuple[np.ndarray, int]:

tchaton added 2 commits April 8, 2021 20:52

update doc

32c62ee

update docs

ba3fcb8

tchaton requested review from Borda, carmocca, edenlightning, justusschock and kaushikb11 as code owners April 9, 2021 11:11

update

befd139

Borda added the documentation Improvements or additions to documentation label Apr 9, 2021

tchaton self-assigned this Apr 9, 2021

tchaton added this to the 0.2 milestone Apr 9, 2021

tchaton added 4 commits April 9, 2021 13:40

update

1a7e5ae

add registry doc

f115d76

revamp doc

f84df82

resolve issues

ed2d892

update example

2edbbae