Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

[doc] Add DataPipeline + Callbacks + Registry #207

Merged
merged 29 commits into from
Apr 13, 2021
Merged

[doc] Add DataPipeline + Callbacks + Registry #207

merged 29 commits into from
Apr 13, 2021

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Apr 9, 2021

What does this PR do?

Docs docs docs

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

@codecov
Copy link

codecov bot commented Apr 9, 2021

Codecov Report

Merging #207 (c096c6f) into master (4a2aa47) will decrease coverage by 0.19%.
The diff coverage is 89.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #207      +/-   ##
==========================================
- Coverage   86.53%   86.33%   -0.20%     
==========================================
  Files          57       57              
  Lines        2732     2759      +27     
==========================================
+ Hits         2364     2382      +18     
- Misses        368      377       +9     
Flag Coverage Δ
unittests 86.33% <89.83%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
flash/__init__.py 100.00% <ø> (ø)
flash/vision/backbones.py 79.72% <66.66%> (-1.43%) ⬇️
flash/data/data_pipeline.py 88.04% <75.00%> (+0.67%) ⬆️
flash/data/data_module.py 77.66% <86.11%> (-0.94%) ⬇️
flash/data/callback.py 98.70% <97.56%> (-1.30%) ⬇️
flash/core/model.py 92.21% <100.00%> (-1.77%) ⬇️
flash/core/registry.py 100.00% <100.00%> (ø)
flash/data/base_viz.py 100.00% <100.00%> (+1.75%) ⬆️
flash/data/process.py 84.41% <100.00%> (ø)
flash/data/utils.py 96.87% <100.00%> (+0.03%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a2aa47...c096c6f. Read the comment docs.

@Borda Borda added the documentation Improvements or additions to documentation label Apr 9, 2021
@tchaton tchaton self-assigned this Apr 9, 2021
@tchaton tchaton added this to the 0.2 milestone Apr 9, 2021
@pep8speaks
Copy link

pep8speaks commented Apr 9, 2021

Hello @tchaton! Thanks for updating this PR.

Line 182:13: W503 line break before binary operator

Comment last updated at 2021-04-13 12:46:25 UTC

*******************************
Using DataModule + DataPipeline
*******************************

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing context.

What are these objects? Why do they exist? when do I need to use them?

I think we can do something like

  1. How to use out-of-the-box flashdatamodules
  2. How to customize existing datamodules
  3. How to build a datamodule for a new task
  4. What are data pipelines and when are they required
  5. how to use them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Definitely. I will work on this next week :)

flash/data/process.py Outdated Show resolved Hide resolved
flash/data/process.py Outdated Show resolved Hide resolved
flash/data/process.py Outdated Show resolved Hide resolved

The :class:`~flash.data.process.Preprocess` is currently supporting the following hooks:

- ``load_data``: Expect some metadata and return an Mapping (can be a ``Dataset``, but not recommended)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what is "some metadata" and what mapping. can we be specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example ? Is it better ?

@edenlightning
Copy link
Contributor

edenlightning commented Apr 9, 2021

This is a lot of information :)

As I see it there are 3 type of documentation:

  1. API (what is the class/method doing, what are the inputs and outputs)
  2. How to use it
  3. How it works

Right now looks like this is all together in the autodoc.
I suggest we separate this.

The autodoc should contain ONLY the API stuff, not how it works etc.

Then, the rest files need to include two layers of abstraction:

  1. How to use this API
  2. How it works behind the scenes

When is this API needed? If you are building a task or just using custom dataset? we might want that separate too

Also, would maybe be good to explain somehwre the relationship between hooks. For example, is load_sample being called by load_data?

flash/data/process.py Outdated Show resolved Hide resolved
flash/data/process.py Outdated Show resolved Hide resolved
flash/data/process.py Outdated Show resolved Hide resolved
flash/data/process.py Outdated Show resolved Hide resolved
flash/data/process.py Outdated Show resolved Hide resolved

Each hook can be made specialized by adding prefix such as ``train``, ``val``, ``test``, ``predict``.

Example::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain this. For example, if we want the data loading at training to do bla bla bla then ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

flash/data/process.py Outdated Show resolved Hide resolved

.. note::

It is possible to wrap a ``Dataset`` within a :class:`~flash.data.process.Preprocess` ``load_data`` function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify.
When would you want to wrap a dataset? is it recommended or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not recommended. I added it there.

The ``pre_tensor_transform``, ``to_tensor_transform``, ``post_tensor_transform``, ``collate``, ``per_batch_transform``
are injected as the ``collate_fn`` function of the DataLoader.

Here is the pseudo code using the preprocess hooks name. Surely, Flash will take care of calling the right hooks for each stage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is def a lot for autodoc. Should we move this to the rst file? Under "How this works under the hood" (if it is necesary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wrote everything in one place to not forget anything and will split it after.


return datasets.MNIST(path_to_data, download=True, transform=transforms.ToTensor())

.. note:: The ``load_data`` and ``load_sample`` will be used to generate an AutoDataset object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is growing way to long. Is this just a metadata object? If so let;s add this description inside the hook. Please explain it in words not in pseudo code.

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Eden, it would be better to separate user guide from API doc strings.

docs/source/custom_task.rst Outdated Show resolved Hide resolved

def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Iterable:

print(self.current_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the need for current_fn? when would we need that?
one can always get the name of the current function through __name__
Tracking the current function being executed ourself could be very fragile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running inside a hook, how would you get the hook name ?

It is used to computer vision where transforms are provided as dictionary with key: hook_name value: transforms.

It enables to implement a current_transform available for the users directly: https://github.com/PyTorchLightning/lightning-flash/blob/4a2aa476c2dcf56d9384c969fc351ea7f46a3b3a/flash/data/process.py#L140

@@ -123,12 +123,17 @@ def _find_matching_index(self, item: _REGISTERED_FUNCTION) -> Optional[int]:

def __call__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if __call__ is very intuitive. I would expect a .add or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had several chat about this but it makes it nicer when using decorator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could support both?

@awaelchli
Copy link
Contributor

The data pipeline offers great features, and it looks very powerful. Thanks for that, looking forward to using it.

The user guide in the class docstrings is making it a bit hard to follow how Preprocess and other components relate to each other. I unfortunately did not understand the motivation behind registry. I suppose this is here to simplify how users select different backbones? Maybe one can show a before and after to highlight how Flash simplifies it for the user.

Available Registries
********************

Registries are Flash internal key-value database to store a mapping between a name and a function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only meant to store backbones? If soo, let's be explicit about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !


In simple words, they are just advanced dictionary storing a function from a key string.

Registries help organize code and make the functions accessible all across the ``Flash`` codebase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add:

If you are creating a custom backbone and want to easily be able to use it in your flash task by simply providing it as a string, you should add it to a registry. You can do that in a few simple steps:

  1. import...
  2. init registry
  3. Add your function as decortaor or directly
  4. you can now access your function from your task!

(add snippets for each step ofcourse)

@@ -34,6 +34,9 @@ Lightning Flash

general/model
general/data
general/callback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about creating an "Advanced" submenu here that will have all the API + creating a custom task tutorial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will do so in another PR. Definitely need some re-organizations.

The :class:`~flash.data.process.Postprocess` hooks covers from model outputs to predictions export.

*******************************************
How to use out-of-the-box flashdatamodules
Copy link
Contributor

@edenlightning edenlightning Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the top of the page.

It should be

Data

Flash has 2 main APIs for handling data: DataModules and Data Pipeline.

DataModules contain the dataset and information for loading the dataset, and DataPipeline contains all preprocessing and postprocessing logic.

Using built-in DataModules


Flash has built-in datamodules you can use to load your data to any existing flash task. For more info, check out....

Customize your data


If you are creating a new task or using a different type of dataset for an existing task, you may want to override ....

and then add the terminology table and then

Create custom DataModule

Create custom Data pipeline

Custom DataModule + data Pipeline

Checkout the :ref:`image_classification` section or any other tasks to learn more about them.

********************************
Why Preprocess and PostProcess ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Why Preprocess and PostProcess ?
Data Processing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 1

Currently, it is common practice to implement a `Dataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset>`_
and provide them to a `DataLoader <https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader>`_.

However, after model training, it requires a lot of engineering overhead to make inference on raw data and deploy the model in production environnement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why? Since the pre+post processing steps are not included, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

Custom Preprocess + Datamodule
******************************

The example below shows a very simple ``ImageClassificationPreprocess`` with a ``ImageClassificationDataModule``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always start with the motivation. Start by explaining it in general, and then move to the example.

If you want to add custom bla and custom bla, you might need to create custom dm and a custom datapipeline.

  1. Create a DataModule - The DataModule should have helper functions to instantiate the DataModule for a relevant input type (folder structure, csv, numpy array, etc). Let's say in our example we have folders of images arranged like.... then we will create a from_folders.... etc
  2. Create the preprocess- your task might need some preprocess and post process transforms. In this example.... bla bla bla
  3. Make sure the processing API is used in your datamodule
  4. Use your new datamodule in your task

Comment on lines 105 to 106
The :class:`~flash.data.process.Preprocess` is used to encapsulate
all the processing data loading logic up to the model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The :class:`~flash.data.process.Preprocess` is used to encapsulate
all the processing data loading logic up to the model.
The :class:`~flash.data.process.Preprocess` encapsulates
all the data processing and loading logic that should run before the data is passed to the model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !


* Output: Return the tensored image and its label.

- ``per_batch_transform``: Performs transforms on a batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which example? i don't think this makes sense in autodoc


.. note::

By default, each hook will be no-op execpt the collate which is PyTorch default collate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the collate a link to the pytorch method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

It is particularly relevant when you want to provide an end to end implementation which works
with 4 different stages: ``train``, ``validation``, ``test``, and inference (``predict``).

The :class:`~flash.data.process.Preprocess` supports the following hooks:
Copy link
Contributor

@edenlightning edenlightning Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just add here something like:
You can override any of the preprocessing hooks to provide custom functionality. All hooks default to no-op (except....)

And then add a snippet example of a full custom Preprocess.

With the snippet you can order them in the order in which they are being called.

All the hooks can be explained in their own autodocs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean this out in another PR.

@@ -168,7 +313,7 @@ def add_callbacks(self, callbacks: List['FlashCallback']):
self._callbacks.extend(_callbacks)

@classmethod
def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Any:
def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Mapping:
"""Loads entire data from Dataset"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more details. What kind of input can you pass in?

How to create the mapping?
+code example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

@@ -178,17 +323,22 @@ def load_sample(cls, sample: Any, dataset: Optional[Any] = None) -> Any:
return sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. more details, return type, code snippet

@@ -178,17 +323,22 @@ def load_sample(cls, sample: Any, dataset: Optional[Any] = None) -> Any:
return sample

def pre_tensor_transform(self, sample: Any) -> Any:
"""Transforms to apply on a single object."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type, code snippet

return sample

def to_tensor_transform(self, sample: Any) -> Tensor:
"""Transforms to convert single object to a tensor."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type, code snippet

return sample

def post_tensor_transform(self, sample: Tensor) -> Tensor:
"""Transforms to apply on a tensor."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type, code snippet

return sample

def post_tensor_transform(self, sample: Tensor) -> Tensor:
"""Transforms to apply on a tensor."""
return sample

def per_batch_transform(self, batch: Any) -> Any:
"""Transforms to apply to a whole batch (if possible use this for efficiency).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type, code snippet

@@ -199,10 +349,14 @@ def collate(self, samples: Sequence) -> Any:

def per_sample_transform_on_device(self, sample: Any) -> Any:
"""Transforms to apply to the data before the collation (per-sample basis).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type, code snippet

@@ -211,7 +365,9 @@ def per_sample_transform_on_device(self, sample: Any) -> Any:
def per_batch_transform_on_device(self, batch: Any) -> Any:
"""
Transforms to apply to a whole batch (if possible use this for efficiency).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type, code snippet

@carmocca carmocca changed the title [doc] Add DataPipeline + Callbacks + Registry [WIP] [doc] Add DataPipeline + Callbacks + Registry Apr 12, 2021
def on_post_tensor_transform(self, sample: Tensor, running_stage: RunningStage) -> None:
self._store(sample, "post_tensor_transform", running_stage)
# visualize a ``train`` batch
dm.show_train_batches()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaton I'm trying to replicate this exact example and I get:
AttributeError: 'CustomImageClassificationData object has no attribute 'show_train_batches'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the test_base_data_fetcher if you want to check it out.

# Assuming you have images in numpy format,
# just override ``load_sample`` hook and add your own logic.
@staticmethod
def load_sample(sample) -> Tuple[Image.Image, int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def load_sample(sample) -> Tuple[Image.Image, int]:
def load_sample(sample) -> Tuple[np.ndarray, int]:

@tchaton tchaton merged commit 6185bda into master Apr 13, 2021
@tchaton tchaton deleted the revamp_doc branch April 13, 2021 14:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.