Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

[1/N] Data Sources #256

Merged
merged 84 commits into from
May 7, 2021
Merged

[1/N] Data Sources #256

merged 84 commits into from
May 7, 2021

Conversation

ethanwharris
Copy link
Collaborator

@ethanwharris ethanwharris commented Apr 29, 2021

What does this PR do?

Adds data sources, docs and new tests to come in later PRs

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests? [not needed for typos/docs]
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

flash/core/model.py Outdated Show resolved Hide resolved
self,
data: Any,
dataset: Optional[Any] = None
) -> Iterable[Mapping[str, Any]]: # TODO: decide what type this should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does string suppose to point, specific files, images, ? It's a bit confusing what the load_data and load_sample methods are doing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are the keys. So data can be any list of dicts like {'input': ..., 'target': ...}. So the load_data and load_sample are the same as before, just now have to be mappings rather than tuples, but I still need to add back the docstring 😃


def __init__(
self,
train_folder: Optional[Union[str, pathlib.Path, list]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to only support primitive types here list and list(str)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, typing is wrong here as it can't be a list anyway, but we should maybe just do str

flash/data/data_source.py Show resolved Hide resolved

def __init__(
self,
train_files: Optional[Union[Sequence[Union[str, pathlib.Path]], Iterable[Union[str, pathlib.Path]]]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, I would simplify to primitive types the inputs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should just be sequence of str

self.extensions = extensions

def load_data(self, data: Any, dataset: Optional[Any] = None) -> Iterable[Mapping[str, Any]]:
# TODO: Bring back the code to work out how many classes there are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_data_from_tuple , etc ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here, data should always be a tuple, just can sometimes be not a tuple if we are predicting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference with SequenceDataSource class ?

from flash.data.utils import convert_to_modules


class ApplyToKeys(nn.Sequential):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will work only for single inputs. E.g. in segmentation or tasks that require augmenting not only the inputs we still need something like this
https://github.com/PyTorchLightning/lightning-flash/pull/239/files#diff-10dfb9297339fe1066151bc80b43feae338b80bfc185f47b951530e388b68719R38

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky, this class allows multi-input by extracting mutliple keys to then be passed to a single aug. So if you had my_augmentation(image=..., mask=...) then this way will work. It's not quite right as it assumes tuple inputs (which won't work with albumenmtations) and we will need something different again for Kornia haha. So, overall, we need something smarter here, but I don't know what yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, my only point was that to apply the same transformation to different inputs we might need to hold the generated random parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when applying random transforms to multiple keys, the transform should be responsible of correctly handling these parameters. Maybe we can somehow determine, whether we should pass the keys altogether to the transform (to have the transform handle this) or to pass them sequentially (for deterministic stuff this is fine).

I don't really like to have framework specific stuff of just one augmentation lib (even if it's a popular one like kornia) hardcoded that deeply.

)

def load_sample(self, sample: Mapping[str, Any], dataset: Optional[Any] = None) -> Any:
result = {} # TODO: this is required to avoid a memory leak, can we automate this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also both load_sample are the same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the issue is that each class needs to have different load_data methods but the same load_sample, I'm working on a way to clean this up haha

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR. Missing tests :)

flash/core/model.py Outdated Show resolved Hide resolved
flash/data/data_module.py Outdated Show resolved Hide resolved
flash/data/data_pipeline.py Outdated Show resolved Hide resolved
flash/data/data_source.py Outdated Show resolved Hide resolved
flash/data/data_source.py Outdated Show resolved Hide resolved
self.extensions = extensions

def load_data(self, data: Any, dataset: Optional[Any] = None) -> Iterable[Mapping[str, Any]]:
# TODO: Bring back the code to work out how many classes there are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference with SequenceDataSource class ?

flash/vision/classification/data.py Outdated Show resolved Hide resolved
flash/vision/classification/data.py Outdated Show resolved Hide resolved
flash/vision/classification/data.py Outdated Show resolved Hide resolved
flash/data/data_module.py Outdated Show resolved Hide resolved
@edgarriba edgarriba linked an issue May 3, 2021 that may be closed by this pull request
flash/data/data_module.py Outdated Show resolved Hide resolved
**kwargs,
batch_size: int = 4,
num_workers: Optional[int] = None,
**kwargs: Any,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would this kwargs be ?

flash/data/data_module.py Outdated Show resolved Hide resolved
@Borda Borda added the enhancement New feature or request label May 5, 2021
@ethanwharris ethanwharris changed the title [WIP] [1/2] Data Sources POC [1/n] Data Sources POC May 7, 2021
@ethanwharris ethanwharris changed the title [1/n] Data Sources POC [1/n] Data Sources May 7, 2021
@ethanwharris ethanwharris marked this pull request as ready for review May 7, 2021 10:51
@ethanwharris ethanwharris changed the title [1/n] Data Sources [1/N] Data Sources May 7, 2021
Copy link
Contributor

@edgarriba edgarriba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very few comments and questions

flash/core/model.py Outdated Show resolved Hide resolved
flash/core/model.py Outdated Show resolved Hide resolved
flash/data/auto_dataset.py Outdated Show resolved Hide resolved
flash/data/data_module.py Outdated Show resolved Hide resolved
flash/data/data_module.py Outdated Show resolved Hide resolved
flash/data/utils.py Outdated Show resolved Hide resolved
flash/tabular/classification/data/data.py Show resolved Hide resolved
flash/tabular/classification/data/data.py Show resolved Hide resolved
flash/tabular/classification/data/data.py Outdated Show resolved Hide resolved
train_load_data_input=(x_train, y_train),
test_load_data_input=(x_test, y_test),
dm = cls.from_data_source(
"numpy",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanwharris now seeing this the example - I see a bit redundant to specify in the constructor the data source type and again here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. The "numpy" here is saying to use the numpy data source which is configured in the constructor. The default_data_source is what is used when predicting if no data source is specified.

flash/core/model.py Outdated Show resolved Hide resolved
flash/core/model.py Outdated Show resolved Hide resolved
flash/core/model.py Show resolved Hide resolved
flash/data/auto_dataset.py Show resolved Hide resolved
)

@classmethod
def from_load_data_inputs(
def from_folders(
cls,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose the datasource in each functions ? like

cls.from_data_source(
             data_source or DefaultDataSources.PATHS, ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote no for now, maybe can add it if we end up implementing more data sources (e.g. if there are multiple data sources for files for a particular task)

flash/text/classification/data.py Outdated Show resolved Hide resolved

def collate(self, samples: Any) -> Tensor:
"""Override to convert a set of samples to a batch"""
if isinstance(samples, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as what?

}

@classmethod
def load_state_dict(cls, state_dict: Dict[str, Any], strict: bool) -> 'VideoClassificationPreprocess':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to the base class ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to the base class then the user would no longer be required to implement get_state_dict and load_state_dict themselves. I played with a few options, best I could come up with was to add a transforms property so that it's just one line.

flash/video/classification/data.py Outdated Show resolved Hide resolved
class ImageNumpyDataSource(NumpyDataSource):

def load_sample(self, sample: Dict[str, Any], dataset: Optional[Any] = None) -> Dict[str, Any]:
sample[DefaultDataKeys.INPUT] = to_pil_image(torch.from_numpy(sample[DefaultDataKeys.INPUT]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense ? We are still trying to get something optimised. Are we converting from NumPy to torch to pil to torch ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's adapt to_tensor_tensor to work on NumPy array directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tchaton that part from user perspective looks a bit weird

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was that the augmentations (which expect PIL images) should still work from numpy and from tensor. This way the behaviour is identical whether you use from_numpy or from_folders. If we change it then the defaults will give an error with numpy

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. So far I just don't get the following: If I want to implement a new task with a completely new datatype: What and how do I have to implement it?

flash/core/model.py Outdated Show resolved Hide resolved

if isinstance(data_source, str):
if preprocess is None:
data_source = DataSource() # TODO: warn the user that we are not using the specified data source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have some kind of registry (similar to what we do with the backbones) and look the source up there if no preprocess is given (in fact this should also be the default behaviour of the preprocess then)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some ideas for a data sources registry. Not sure it makes sense to just map strings to data sources as most data sources only work with particular preprocesses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the registry would most likely be task specific as well (similar to backbones).


if TYPE_CHECKING:
from flash.data.data_pipeline import DataPipeline
from flash.data.data_source import DataSource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I think sphinx will have issues with forward declarations like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs build is working for now, I'm not sure we can avoid a circular import here but could maybe just import the module and type as data_source.DataSource.

flash/data/auto_dataset.py Outdated Show resolved Hide resolved
flash/data/auto_dataset.py Outdated Show resolved Hide resolved
flash/data/data_source.py Show resolved Hide resolved
flash/data/data_source.py Outdated Show resolved Hide resolved
flash/data/data_source.py Outdated Show resolved Hide resolved
flash/data/process.py Outdated Show resolved Hide resolved
from flash.data.utils import convert_to_modules


class ApplyToKeys(nn.Sequential):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when applying random transforms to multiple keys, the transform should be responsible of correctly handling these parameters. Maybe we can somehow determine, whether we should pass the keys altogether to the transform (to have the transform handle this) or to pass them sequentially (for deterministic stuff this is fine).

I don't really like to have framework specific stuff of just one augmentation lib (even if it's a popular one like kornia) hardcoded that deeply.

@ethanwharris
Copy link
Collaborator Author

ethanwharris commented May 7, 2021

@edgarriba @tchaton @justusschock Thanks for all the feedback, I've now finished responding to comments, let me know if you have anything further.

@justusschock Regarding adding new tasks with custom data, the user would just implement their own DataSource and set this up in their preprocess. They can then either add a custom from_* method to their DataModule or just use the from_data_source method. An example of this is the "coco" data source in /vision/detection/data. There will be some proper docs about this coming in a subsequent PR 😃

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@tchaton tchaton merged commit ce63fd7 into master May 7, 2021
@tchaton tchaton deleted the feature/data_sources branch May 7, 2021 16:51
@edgarriba edgarriba mentioned this pull request May 7, 2021
8 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
6 participants