-
Notifications
You must be signed in to change notification settings - Fork 212
Conversation
flash/data/data_source.py
Outdated
self, | ||
data: Any, | ||
dataset: Optional[Any] = None | ||
) -> Iterable[Mapping[str, Any]]: # TODO: decide what type this should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does string suppose to point, specific files, images, ? It's a bit confusing what the load_data
and load_sample
methods are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings are the keys. So data can be any list of dicts like {'input': ..., 'target': ...}
. So the load_data
and load_sample
are the same as before, just now have to be mappings rather than tuples, but I still need to add back the docstring 😃
flash/data/data_source.py
Outdated
|
||
def __init__( | ||
self, | ||
train_folder: Optional[Union[str, pathlib.Path, list]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote to only support primitive types here list
and list(str)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, typing is wrong here as it can't be a list anyway, but we should maybe just do str
flash/data/data_source.py
Outdated
|
||
def __init__( | ||
self, | ||
train_files: Optional[Union[Sequence[Union[str, pathlib.Path]], Iterable[Union[str, pathlib.Path]]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before, I would simplify to primitive types the inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should just be sequence of str
flash/data/data_source.py
Outdated
self.extensions = extensions | ||
|
||
def load_data(self, data: Any, dataset: Optional[Any] = None) -> Iterable[Mapping[str, Any]]: | ||
# TODO: Bring back the code to work out how many classes there are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_load_data_from_tuple
, etc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here, data
should always be a tuple, just can sometimes be not a tuple if we are predicting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference with SequenceDataSource
class ?
from flash.data.utils import convert_to_modules | ||
|
||
|
||
class ApplyToKeys(nn.Sequential): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will work only for single inputs. E.g. in segmentation or tasks that require augmenting not only the inputs we still need something like this
https://github.com/PyTorchLightning/lightning-flash/pull/239/files#diff-10dfb9297339fe1066151bc80b43feae338b80bfc185f47b951530e388b68719R38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky, this class allows multi-input by extracting mutliple keys to then be passed to a single aug. So if you had my_augmentation(image=..., mask=...)
then this way will work. It's not quite right as it assumes tuple inputs (which won't work with albumenmtations) and we will need something different again for Kornia haha. So, overall, we need something smarter here, but I don't know what yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, my only point was that to apply the same transformation to different inputs we might need to hold the generated random parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when applying random transforms to multiple keys, the transform should be responsible of correctly handling these parameters. Maybe we can somehow determine, whether we should pass the keys altogether to the transform (to have the transform handle this) or to pass them sequentially (for deterministic stuff this is fine).
I don't really like to have framework specific stuff of just one augmentation lib (even if it's a popular one like kornia) hardcoded that deeply.
flash/vision/data.py
Outdated
) | ||
|
||
def load_sample(self, sample: Mapping[str, Any], dataset: Optional[Any] = None) -> Any: | ||
result = {} # TODO: this is required to avoid a memory leak, can we automate this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also both load_sample
are the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the issue is that each class needs to have different load_data
methods but the same load_sample
, I'm working on a way to clean this up haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR. Missing tests :)
flash/data/data_source.py
Outdated
self.extensions = extensions | ||
|
||
def load_data(self, data: Any, dataset: Optional[Any] = None) -> Iterable[Mapping[str, Any]]: | ||
# TODO: Bring back the code to work out how many classes there are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference with SequenceDataSource
class ?
flash/data/data_module.py
Outdated
**kwargs, | ||
batch_size: int = 4, | ||
num_workers: Optional[int] = None, | ||
**kwargs: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would this kwargs
be ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Very few comments and questions
train_load_data_input=(x_train, y_train), | ||
test_load_data_input=(x_test, y_test), | ||
dm = cls.from_data_source( | ||
"numpy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ethanwharris now seeing this the example - I see a bit redundant to specify in the constructor the data source type and again here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. The "numpy" here is saying to use the numpy data source which is configured in the constructor. The default_data_source
is what is used when predicting if no data source is specified.
) | ||
|
||
@classmethod | ||
def from_load_data_inputs( | ||
def from_folders( | ||
cls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expose the datasource in each functions ? like
cls.from_data_source(
data_source or DefaultDataSources.PATHS, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote no for now, maybe can add it if we end up implementing more data sources (e.g. if there are multiple data sources for files for a particular task)
|
||
def collate(self, samples: Any) -> Tensor: | ||
"""Override to convert a set of samples to a batch""" | ||
if isinstance(samples, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as what?
} | ||
|
||
@classmethod | ||
def load_state_dict(cls, state_dict: Dict[str, Any], strict: bool) -> 'VideoClassificationPreprocess': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to the base class ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move to the base class then the user would no longer be required to implement get_state_dict
and load_state_dict
themselves. I played with a few options, best I could come up with was to add a transforms
property so that it's just one line.
class ImageNumpyDataSource(NumpyDataSource): | ||
|
||
def load_sample(self, sample: Dict[str, Any], dataset: Optional[Any] = None) -> Dict[str, Any]: | ||
sample[DefaultDataKeys.INPUT] = to_pil_image(torch.from_numpy(sample[DefaultDataKeys.INPUT])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense ? We are still trying to get something optimised. Are we converting from NumPy to torch to pil to torch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's adapt to_tensor_tensor to work on NumPy array directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tchaton that part from user perspective looks a bit weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking here was that the augmentations (which expect PIL images) should still work from numpy and from tensor. This way the behaviour is identical whether you use from_numpy or from_folders. If we change it then the defaults will give an error with numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. So far I just don't get the following: If I want to implement a new task with a completely new datatype: What and how do I have to implement it?
|
||
if isinstance(data_source, str): | ||
if preprocess is None: | ||
data_source = DataSource() # TODO: warn the user that we are not using the specified data source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we have some kind of registry (similar to what we do with the backbones) and look the source up there if no preprocess is given (in fact this should also be the default behaviour of the preprocess then)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some ideas for a data sources registry. Not sure it makes sense to just map strings to data sources as most data sources only work with particular preprocesses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the registry would most likely be task specific as well (similar to backbones).
|
||
if TYPE_CHECKING: | ||
from flash.data.data_pipeline import DataPipeline | ||
from flash.data.data_source import DataSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I think sphinx will have issues with forward declarations like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs build is working for now, I'm not sure we can avoid a circular import here but could maybe just import the module and type as data_source.DataSource
.
from flash.data.utils import convert_to_modules | ||
|
||
|
||
class ApplyToKeys(nn.Sequential): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when applying random transforms to multiple keys, the transform should be responsible of correctly handling these parameters. Maybe we can somehow determine, whether we should pass the keys altogether to the transform (to have the transform handle this) or to pass them sequentially (for deterministic stuff this is fine).
I don't really like to have framework specific stuff of just one augmentation lib (even if it's a popular one like kornia) hardcoded that deeply.
@edgarriba @tchaton @justusschock Thanks for all the feedback, I've now finished responding to comments, let me know if you have anything further. @justusschock Regarding adding new tasks with custom data, the user would just implement their own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
What does this PR do?
Adds data sources, docs and new tests to come in later PRs
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃