[1/N] Data Sources #256

ethanwharris · 2021-04-29T19:55:47Z

What does this PR do?

Adds data sources, docs and new tests to come in later PRs

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests? [not needed for typos/docs]
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

flash/core/model.py

edgarriba · 2021-04-30T08:39:36Z

flash/data/data_source.py

+        self,
+        data: Any,
+        dataset: Optional[Any] = None
+    ) -> Iterable[Mapping[str, Any]]:  # TODO: decide what type this should be


where does string suppose to point, specific files, images, ? It's a bit confusing what the load_data and load_sample methods are doing.

Strings are the keys. So data can be any list of dicts like {'input': ..., 'target': ...}. So the load_data and load_sample are the same as before, just now have to be mappings rather than tuples, but I still need to add back the docstring 😃

edgarriba · 2021-04-30T08:41:15Z

flash/data/data_source.py

+
+    def __init__(
+        self,
+        train_folder: Optional[Union[str, pathlib.Path, list]] = None,


I vote to only support primitive types here list and list(str)

Yeah, typing is wrong here as it can't be a list anyway, but we should maybe just do str

flash/data/data_source.py

edgarriba · 2021-04-30T08:43:18Z

flash/data/data_source.py

+
+    def __init__(
+        self,
+        train_files: Optional[Union[Sequence[Union[str, pathlib.Path]], Iterable[Union[str, pathlib.Path]]]] = None,


same as before, I would simplify to primitive types the inputs

Yeah, this should just be sequence of str

edgarriba · 2021-04-30T08:44:28Z

flash/data/data_source.py

+        self.extensions = extensions
+
+    def load_data(self, data: Any, dataset: Optional[Any] = None) -> Iterable[Mapping[str, Any]]:
+        # TODO: Bring back the code to work out how many classes there are


_load_data_from_tuple , etc ?

Not sure what you mean here, data should always be a tuple, just can sometimes be not a tuple if we are predicting.

what is the difference with SequenceDataSource class ?

edgarriba · 2021-04-30T08:49:30Z

flash/data/transforms.py

+from flash.data.utils import convert_to_modules
+
+
+class ApplyToKeys(nn.Sequential):


this will work only for single inputs. E.g. in segmentation or tasks that require augmenting not only the inputs we still need something like this
https://github.com/PyTorchLightning/lightning-flash/pull/239/files#diff-10dfb9297339fe1066151bc80b43feae338b80bfc185f47b951530e388b68719R38

It's tricky, this class allows multi-input by extracting mutliple keys to then be passed to a single aug. So if you had my_augmentation(image=..., mask=...) then this way will work. It's not quite right as it assumes tuple inputs (which won't work with albumenmtations) and we will need something different again for Kornia haha. So, overall, we need something smarter here, but I don't know what yet

agree, my only point was that to apply the same transformation to different inputs we might need to hold the generated random parameters.

I think when applying random transforms to multiple keys, the transform should be responsible of correctly handling these parameters. Maybe we can somehow determine, whether we should pass the keys altogether to the transform (to have the transform handle this) or to pass them sequentially (for deterministic stuff this is fine).

I don't really like to have framework specific stuff of just one augmentation lib (even if it's a popular one like kornia) hardcoded that deeply.

edgarriba · 2021-04-30T08:51:48Z

flash/vision/data.py

+        )
+
+    def load_sample(self, sample: Mapping[str, Any], dataset: Optional[Any] = None) -> Any:
+        result = {}  # TODO: this is required to avoid a memory leak, can we automate this?


also both load_sample are the same

Yeah, the issue is that each class needs to have different load_data methods but the same load_sample, I'm working on a way to clean this up haha

tchaton

Great PR. Missing tests :)

flash/core/model.py

flash/data/data_module.py

flash/data/data_pipeline.py

flash/data/data_source.py

tchaton · 2021-04-30T09:36:52Z

flash/data/data_source.py

+        self.extensions = extensions
+
+    def load_data(self, data: Any, dataset: Optional[Any] = None) -> Iterable[Mapping[str, Any]]:
+        # TODO: Bring back the code to work out how many classes there are


what is the difference with SequenceDataSource class ?

flash/vision/classification/data.py

flash/data/data_module.py

edgarriba · 2021-05-03T09:30:46Z

flash/data/data_module.py

-        **kwargs,
+        batch_size: int = 4,
+        num_workers: Optional[int] = None,
+        **kwargs: Any,


what would this kwargs be ?

flash/data/data_module.py

edgarriba

LGTM. Very few comments and questions

flash/core/model.py

flash/data/auto_dataset.py

flash/data/data_module.py

flash/data/utils.py

flash/tabular/classification/data/data.py

edgarriba · 2021-05-07T12:40:02Z

flash_examples/custom_task.py

-            train_load_data_input=(x_train, y_train),
-            test_load_data_input=(x_test, y_test),
+        dm = cls.from_data_source(
+            "numpy",


@ethanwharris now seeing this the example - I see a bit redundant to specify in the constructor the data source type and again here

Not sure what you mean. The "numpy" here is saying to use the numpy data source which is configured in the constructor. The default_data_source is what is used when predicting if no data source is specified.

flash/core/model.py

flash/data/auto_dataset.py

tchaton · 2021-05-07T11:15:10Z

flash/data/data_module.py

        )

    @classmethod
-    def from_load_data_inputs(
+    def from_folders(
        cls,


Should we expose the datasource in each functions ? like

cls.from_data_source( data_source or DefaultDataSources.PATHS, ...

I would vote no for now, maybe can add it if we end up implementing more data sources (e.g. if there are multiple data sources for files for a particular task)

flash/text/classification/data.py

tchaton · 2021-05-07T12:46:19Z

flash/text/classification/data.py

+
+    def collate(self, samples: Any) -> Tensor:
+        """Override to convert a set of samples to a batch"""
+        if isinstance(samples, dict):


same here ?

Same as what?

tchaton · 2021-05-07T12:50:15Z

flash/video/classification/data.py

+        }
+
+    @classmethod
+    def load_state_dict(cls, state_dict: Dict[str, Any], strict: bool) -> 'VideoClassificationPreprocess':


Can we move this to the base class ?

If we move to the base class then the user would no longer be required to implement get_state_dict and load_state_dict themselves. I played with a few options, best I could come up with was to add a transforms property so that it's just one line.

flash/video/classification/data.py

tchaton · 2021-05-07T12:52:29Z

flash/vision/data.py

+class ImageNumpyDataSource(NumpyDataSource):
+
+    def load_sample(self, sample: Dict[str, Any], dataset: Optional[Any] = None) -> Dict[str, Any]:
+        sample[DefaultDataKeys.INPUT] = to_pil_image(torch.from_numpy(sample[DefaultDataKeys.INPUT]))


Does this make sense ? We are still trying to get something optimised. Are we converting from NumPy to torch to pil to torch ?

Let's adapt to_tensor_tensor to work on NumPy array directly.

I agree with @tchaton that part from user perspective looks a bit weird

My thinking here was that the augmentations (which expect PIL images) should still work from numpy and from tensor. This way the behaviour is identical whether you use from_numpy or from_folders. If we change it then the defaults will give an error with numpy

justusschock

I like it. So far I just don't get the following: If I want to implement a new task with a completely new datatype: What and how do I have to implement it?

flash/core/model.py

justusschock · 2021-05-07T14:10:20Z

flash/core/model.py

+
+        if isinstance(data_source, str):
+            if preprocess is None:
+                data_source = DataSource()  # TODO: warn the user that we are not using the specified data source


Can't we have some kind of registry (similar to what we do with the backbones) and look the source up there if no preprocess is given (in fact this should also be the default behaviour of the preprocess then)?

I have some ideas for a data sources registry. Not sure it makes sense to just map strings to data sources as most data sources only work with particular preprocesses.

Yes, but the registry would most likely be task specific as well (similar to backbones).

justusschock · 2021-05-07T14:11:55Z

flash/data/auto_dataset.py


 if TYPE_CHECKING:
    from flash.data.data_pipeline import DataPipeline
+    from flash.data.data_source import DataSource


Not sure, but I think sphinx will have issues with forward declarations like this.

Docs build is working for now, I'm not sure we can avoid a circular import here but could maybe just import the module and type as data_source.DataSource.

flash/data/auto_dataset.py

flash/data/data_source.py

flash/data/process.py

justusschock · 2021-05-07T14:33:56Z

flash/data/transforms.py

+from flash.data.utils import convert_to_modules
+
+
+class ApplyToKeys(nn.Sequential):


I think when applying random transforms to multiple keys, the transform should be responsible of correctly handling these parameters. Maybe we can somehow determine, whether we should pass the keys altogether to the transform (to have the transform handle this) or to pass them sequentially (for deterministic stuff this is fine).

I don't really like to have framework specific stuff of just one augmentation lib (even if it's a popular one like kornia) hardcoded that deeply.

ethanwharris · 2021-05-07T16:45:36Z

@edgarriba @tchaton @justusschock Thanks for all the feedback, I've now finished responding to comments, let me know if you have anything further.

@justusschock Regarding adding new tasks with custom data, the user would just implement their own DataSource and set this up in their preprocess. They can then either add a custom from_* method to their DataModule or just use the from_data_source method. An example of this is the "coco" data source in /vision/detection/data. There will be some proper docs about this coming in a subsequent PR 😃

tchaton

LGTM !

ethanwharris added 4 commits April 28, 2021 14:24

Initial commit

735740e

POC Initial commit

be01397

Remove unused code

214df85

Some fixes

8f93bfb

edgarriba reviewed Apr 30, 2021

View reviewed changes

tchaton reviewed Apr 30, 2021

View reviewed changes

ethanwharris added 6 commits April 30, 2021 18:12

Simplify data source

e8ee4c0

Expand preprocess

653057d

Fixes

0184332

Fixes

5172a06

Cleaning

5c3f597

Fixes

44d70e1

edgarriba mentioned this pull request May 2, 2021

from_csv or from_df method in ImageClassificationData #257

Closed

edgarriba linked an issue May 3, 2021 that may be closed by this pull request

clean kwargs in the data module #253

Closed

edgarriba reviewed May 3, 2021

View reviewed changes

edgarriba mentioned this pull request May 3, 2021

Explore Dask as a drop-in for Pandas for Tabular data #231

Closed

This was linked to issues May 3, 2021

Uniformize Flash API to TorchVision / TorchVideo #218

Closed

Add Standardization for Flash DataModules #165

Closed

edgarriba mentioned this pull request May 3, 2021

Customizable data pipeline for object detection #159

Closed

edgarriba linked an issue May 3, 2021 that may be closed by this pull request

from_datamodules and dataset flexibility #135

Closed

ethanwharris and others added 9 commits May 4, 2021 11:45

Remove un-needed code

08657ea

Remove sequence data source

73be792

Simplify data source

3381840

Fix FilesDataSource

e01987d

Minor fix

e385dfa

Add numpy and tesnor data sources

dc90754

Fixes

c437043

Onboard object detection

b32ee34

update

bfd320d

Borda added the enhancement New feature or request label May 5, 2021

ethanwharris added 2 commits May 7, 2021 11:41

Fixes

950b13f

Fixes

f47208c

ethanwharris changed the title ~~[WIP] [1/2] Data Sources POC~~ [1/n] Data Sources POC May 7, 2021

ethanwharris changed the title ~~[1/n] Data Sources POC~~ [1/n] Data Sources May 7, 2021

ethanwharris marked this pull request as ready for review May 7, 2021 10:51

ethanwharris requested review from Borda, carmocca, edenlightning, justusschock and kaushikb11 as code owners May 7, 2021 10:51

ethanwharris changed the title ~~[1/n] Data Sources~~ [1/N] Data Sources May 7, 2021

edgarriba approved these changes May 7, 2021

View reviewed changes

tchaton reviewed May 7, 2021

View reviewed changes

ethanwharris added 2 commits May 7, 2021 14:51

Respond to comments

db0c991

feedback

db1cdf1

justusschock reviewed May 7, 2021

View reviewed changes

ethanwharris added 8 commits May 7, 2021 16:08

Updates

88cbc65

Fixes

4ee1dd4

Fixes

ce3fcf2

revert

ed22b10

Updates

f453d03

Fixes

1ae8c56

Fixes

1088022

Fixes

9032be4

tchaton approved these changes May 7, 2021

View reviewed changes

tchaton merged commit ce63fd7 into master May 7, 2021

tchaton deleted the feature/data_sources branch May 7, 2021 16:51

edgarriba mentioned this pull request May 7, 2021

[2/N] Data Sources #264

Merged

8 tasks

ethanwharris mentioned this pull request May 9, 2021

ImageClassificationData.from_arrays #151

Closed

		from flash.data.utils import convert_to_modules


		class ApplyToKeys(nn.Sequential):

[1/N] Data Sources #256

[1/N] Data Sources #256

Conversation

ethanwharris commented Apr 29, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgarriba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justusschock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethanwharris commented May 7, 2021 • edited Loading

tchaton left a comment

Choose a reason for hiding this comment

ethanwharris commented Apr 29, 2021 •

edited

Loading

ethanwharris commented May 7, 2021 •

edited

Loading