Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More explicit dataset outputs [FEATURE] #196

Closed
melisande-c opened this issue Aug 2, 2024 · 1 comment
Closed

More explicit dataset outputs [FEATURE] #196

melisande-c opened this issue Aug 2, 2024 · 1 comment
Assignees
Labels
feature New feature or request

Comments

@melisande-c
Copy link
Member

melisande-c commented Aug 2, 2024

Proposal for more explicit dataset outputs / model inputs.

Motivation
Currently, the datasets output tuples of objects that can be some combination of arrays and extra metadata such as the TileInformation. This makes it unclear at some points in the code what objects are. Additionally for the predict_to_disk method that is being worked on in #189, it would be convenient if the file path of the the sample/tile could also be returned.

Currently the different cases for dataset outputs are:
N2V: (masked_patch, original_patch, mask)
Supervised: (patch, target_patch)
Image prediction: (image, )
Tiled prediction: (tile, tile_information)

There are potentially 4 different solutions: dictionaries, named tuples, dataclasses and xarrays.

Dictionaries
Pros:

  • Easily extendable, i.e. if we need to return more things we can easily add another key.
  • Default collate function works with dicts.

Cons:

  • Would have to change a few places where dataset outputs are unpacked.
  • Potentially not as explicit, but we won't need different classes for different outputs.

Named tuples
Pros:

  • Unpacks the same as tuples.

Cons:

  • No type hinting of attributes.
  • No default values.

Dataclasses
Pros:

  • Unpacks the same as tuples.

Cons:

  • Not compatible with torch.utils.data.dataloader.default_collate so have to write our own.
  • Having different classes for the different cases adds bloat to the codebase.
  • Not easily extendable.

XArray
Pros:

  • Can keep metadata (e.g. TileInfo, file_path) in separate attrs attribute.

Cons:

  • Not compatible with torch.utils.data.dataloader.default_collate so have to write our own.
  • Writing collate function will be annoying with the .attrs attribute.
  • Having different classes for the different cases adds bloat to the codebase.
  • Not easily extendable.

Summary and Examples

My opinion is that dictionaries or dataclasses are a good option. For dataclasses I think we could try to unify the different cases a little, with potentially 3 instead of 4, as follows.

@dataclass
class N2VInput:
    input: NDArray # masked patch
    target: NDArray # original patch
    mask: NDArray
    metadata: dict = {} # for potential future use or other applications

@dataclass
class SupervisedInput:
    input: NDArray
    target: NDArray
    metadata: dict = {} # for potential future use or other applications

@dataclass
class PredictionInput:
    input: NDArray
    metadata: dict

The PredictionInput.metadata attribute can contain tile information or the file path when applicable.

For dictionaries we can have:

n2v_input = {
    "input": masked_patch,
    "target": original_patch,
    "mask": mask
}

supervised_input = {
    "input": patch
    "target": target patch
}

image_prediction = {
    "input": input
}

image_file_prediction = {
    "input": input,
    "file_path": file_path
}

tiled_file_prediction = {
    "input": tile,
    "tile_information": tile_information,
    "file_path": file_path
}

^ Ignore my naming of the dictionaries

EDIT: Different options

@melisande-c melisande-c added the feature New feature or request label Aug 2, 2024
@melisande-c
Copy link
Member Author

Closing because not relevant any more, the problems mentioned in regards to the predict to disk functionality can be addressed in other ways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants