-
Notifications
You must be signed in to change notification settings - Fork 212
[doc] Add DataPipeline + Callbacks + Registry #207
Conversation
Codecov Report
@@ Coverage Diff @@
## master #207 +/- ##
==========================================
- Coverage 86.53% 86.33% -0.20%
==========================================
Files 57 57
Lines 2732 2759 +27
==========================================
+ Hits 2364 2382 +18
- Misses 368 377 +9
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Hello @tchaton! Thanks for updating this PR.
Comment last updated at 2021-04-13 12:46:25 UTC |
******************************* | ||
Using DataModule + DataPipeline | ||
******************************* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing context.
What are these objects? Why do they exist? when do I need to use them?
I think we can do something like
- How to use out-of-the-box flashdatamodules
- How to customize existing datamodules
- How to build a datamodule for a new task
- What are data pipelines and when are they required
- how to use them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Definitely. I will work on this next week :)
flash/data/process.py
Outdated
|
||
The :class:`~flash.data.process.Preprocess` is currently supporting the following hooks: | ||
|
||
- ``load_data``: Expect some metadata and return an Mapping (can be a ``Dataset``, but not recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what is "some metadata" and what mapping. can we be specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example ? Is it better ?
This is a lot of information :) As I see it there are 3 type of documentation:
Right now looks like this is all together in the autodoc. The autodoc should contain ONLY the API stuff, not how it works etc. Then, the rest files need to include two layers of abstraction:
When is this API needed? If you are building a task or just using custom dataset? we might want that separate too Also, would maybe be good to explain somehwre the relationship between hooks. For example, is load_sample being called by load_data? |
flash/data/process.py
Outdated
|
||
Each hook can be made specialized by adding prefix such as ``train``, ``val``, ``test``, ``predict``. | ||
|
||
Example:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain this. For example, if we want the data loading at training to do bla bla bla then ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
flash/data/process.py
Outdated
|
||
.. note:: | ||
|
||
It is possible to wrap a ``Dataset`` within a :class:`~flash.data.process.Preprocess` ``load_data`` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify.
When would you want to wrap a dataset? is it recommended or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not recommended. I added it there.
flash/data/process.py
Outdated
The ``pre_tensor_transform``, ``to_tensor_transform``, ``post_tensor_transform``, ``collate``, ``per_batch_transform`` | ||
are injected as the ``collate_fn`` function of the DataLoader. | ||
|
||
Here is the pseudo code using the preprocess hooks name. Surely, Flash will take care of calling the right hooks for each stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is def a lot for autodoc. Should we move this to the rst file? Under "How this works under the hood" (if it is necesary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wrote everything in one place to not forget anything and will split it after.
flash/data/process.py
Outdated
|
||
return datasets.MNIST(path_to_data, download=True, transform=transforms.ToTensor()) | ||
|
||
.. note:: The ``load_data`` and ``load_sample`` will be used to generate an AutoDataset object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is growing way to long. Is this just a metadata object? If so let;s add this description inside the hook. Please explain it in words not in pseudo code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Eden, it would be better to separate user guide from API doc strings.
flash/data/process.py
Outdated
|
||
def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Iterable: | ||
|
||
print(self.current_fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the need for current_fn? when would we need that?
one can always get the name of the current function through __name__
Tracking the current function being executed ourself could be very fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running inside a hook, how would you get the hook name ?
It is used to computer vision where transforms are provided as dictionary with key: hook_name value: transforms.
It enables to implement a current_transform available for the users directly: https://github.com/PyTorchLightning/lightning-flash/blob/4a2aa476c2dcf56d9384c969fc351ea7f46a3b3a/flash/data/process.py#L140
@@ -123,12 +123,17 @@ def _find_matching_index(self, item: _REGISTERED_FUNCTION) -> Optional[int]: | |||
|
|||
def __call__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if __call__
is very intuitive. I would expect a .add
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we had several chat about this but it makes it nicer when using decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could support both?
The data pipeline offers great features, and it looks very powerful. Thanks for that, looking forward to using it. The user guide in the class docstrings is making it a bit hard to follow how Preprocess and other components relate to each other. I unfortunately did not understand the motivation behind registry. I suppose this is here to simplify how users select different backbones? Maybe one can show a before and after to highlight how Flash simplifies it for the user. |
Available Registries | ||
******************** | ||
|
||
Registries are Flash internal key-value database to store a mapping between a name and a function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only meant to store backbones? If soo, let's be explicit about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
|
||
In simple words, they are just advanced dictionary storing a function from a key string. | ||
|
||
Registries help organize code and make the functions accessible all across the ``Flash`` codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add:
If you are creating a custom backbone and want to easily be able to use it in your flash task by simply providing it as a string, you should add it to a registry. You can do that in a few simple steps:
- import...
- init registry
- Add your function as decortaor or directly
- you can now access your function from your task!
(add snippets for each step ofcourse)
@@ -34,6 +34,9 @@ Lightning Flash | |||
|
|||
general/model | |||
general/data | |||
general/callback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about creating an "Advanced" submenu here that will have all the API + creating a custom task tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will do so in another PR. Definitely need some re-organizations.
The :class:`~flash.data.process.Postprocess` hooks covers from model outputs to predictions export. | ||
|
||
******************************************* | ||
How to use out-of-the-box flashdatamodules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to the top of the page.
It should be
Data
Flash has 2 main APIs for handling data: DataModules and Data Pipeline.
DataModules contain the dataset and information for loading the dataset, and DataPipeline contains all preprocessing and postprocessing logic.
Using built-in DataModules
Flash has built-in datamodules you can use to load your data to any existing flash task. For more info, check out....
Customize your data
If you are creating a new task or using a different type of dataset for an existing task, you may want to override ....
and then add the terminology table and then
Create custom DataModule
Create custom Data pipeline
Custom DataModule + data Pipeline
docs/source/general/data.rst
Outdated
Checkout the :ref:`image_classification` section or any other tasks to learn more about them. | ||
|
||
******************************** | ||
Why Preprocess and PostProcess ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Preprocess and PostProcess ? | |
Data Processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 1
Currently, it is common practice to implement a `Dataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset>`_ | ||
and provide them to a `DataLoader <https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader>`_. | ||
|
||
However, after model training, it requires a lot of engineering overhead to make inference on raw data and deploy the model in production environnement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain why? Since the pre+post processing steps are not included, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
Custom Preprocess + Datamodule | ||
****************************** | ||
|
||
The example below shows a very simple ``ImageClassificationPreprocess`` with a ``ImageClassificationDataModule``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always start with the motivation. Start by explaining it in general, and then move to the example.
If you want to add custom bla and custom bla, you might need to create custom dm and a custom datapipeline.
- Create a DataModule - The DataModule should have helper functions to instantiate the DataModule for a relevant input type (folder structure, csv, numpy array, etc). Let's say in our example we have folders of images arranged like.... then we will create a from_folders.... etc
- Create the preprocess- your task might need some preprocess and post process transforms. In this example.... bla bla bla
- Make sure the processing API is used in your datamodule
- Use your new datamodule in your task
flash/data/process.py
Outdated
The :class:`~flash.data.process.Preprocess` is used to encapsulate | ||
all the processing data loading logic up to the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :class:`~flash.data.process.Preprocess` is used to encapsulate | |
all the processing data loading logic up to the model. | |
The :class:`~flash.data.process.Preprocess` encapsulates | |
all the data processing and loading logic that should run before the data is passed to the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
|
||
* Output: Return the tensored image and its label. | ||
|
||
- ``per_batch_transform``: Performs transforms on a batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which example? i don't think this makes sense in autodoc
flash/data/process.py
Outdated
|
||
.. note:: | ||
|
||
By default, each hook will be no-op execpt the collate which is PyTorch default collate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the collate a link to the pytorch method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
It is particularly relevant when you want to provide an end to end implementation which works | ||
with 4 different stages: ``train``, ``validation``, ``test``, and inference (``predict``). | ||
|
||
The :class:`~flash.data.process.Preprocess` supports the following hooks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just add here something like:
You can override any of the preprocessing hooks to provide custom functionality. All hooks default to no-op (except....)
And then add a snippet example of a full custom Preprocess.
With the snippet you can order them in the order in which they are being called.
All the hooks can be explained in their own autodocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clean this out in another PR.
flash/data/process.py
Outdated
@@ -168,7 +313,7 @@ def add_callbacks(self, callbacks: List['FlashCallback']): | |||
self._callbacks.extend(_callbacks) | |||
|
|||
@classmethod | |||
def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Any: | |||
def load_data(cls, data: Any, dataset: Optional[Any] = None) -> Mapping: | |||
"""Loads entire data from Dataset""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more details. What kind of input can you pass in?
How to create the mapping?
+code example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
@@ -178,17 +323,22 @@ def load_sample(cls, sample: Any, dataset: Optional[Any] = None) -> Any: | |||
return sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. more details, return type, code snippet
@@ -178,17 +323,22 @@ def load_sample(cls, sample: Any, dataset: Optional[Any] = None) -> Any: | |||
return sample | |||
|
|||
def pre_tensor_transform(self, sample: Any) -> Any: | |||
"""Transforms to apply on a single object.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type, code snippet
return sample | ||
|
||
def to_tensor_transform(self, sample: Any) -> Tensor: | ||
"""Transforms to convert single object to a tensor.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type, code snippet
return sample | ||
|
||
def post_tensor_transform(self, sample: Tensor) -> Tensor: | ||
"""Transforms to apply on a tensor.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type, code snippet
return sample | ||
|
||
def post_tensor_transform(self, sample: Tensor) -> Tensor: | ||
"""Transforms to apply on a tensor.""" | ||
return sample | ||
|
||
def per_batch_transform(self, batch: Any) -> Any: | ||
"""Transforms to apply to a whole batch (if possible use this for efficiency). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type, code snippet
@@ -199,10 +349,14 @@ def collate(self, samples: Sequence) -> Any: | |||
|
|||
def per_sample_transform_on_device(self, sample: Any) -> Any: | |||
"""Transforms to apply to the data before the collation (per-sample basis). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type, code snippet
@@ -211,7 +365,9 @@ def per_sample_transform_on_device(self, sample: Any) -> Any: | |||
def per_batch_transform_on_device(self, batch: Any) -> Any: | |||
""" | |||
Transforms to apply to a whole batch (if possible use this for efficiency). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type, code snippet
def on_post_tensor_transform(self, sample: Tensor, running_stage: RunningStage) -> None: | ||
self._store(sample, "post_tensor_transform", running_stage) | ||
# visualize a ``train`` batch | ||
dm.show_train_batches() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaton I'm trying to replicate this exact example and I get:
AttributeError: 'CustomImageClassificationData
object has no attribute 'show_train_batches'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have the test_base_data_fetcher
if you want to check it out.
# Assuming you have images in numpy format, | ||
# just override ``load_sample`` hook and add your own logic. | ||
@staticmethod | ||
def load_sample(sample) -> Tuple[Image.Image, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def load_sample(sample) -> Tuple[Image.Image, int]: | |
def load_sample(sample) -> Tuple[np.ndarray, int]: |
What does this PR do?
Docs docs docs
Before submitting
PR review