Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Samples which extend samples from other datasets #5275

Open
2 of 6 tasks
jnewb1 opened this issue Dec 13, 2024 · 3 comments
Open
2 of 6 tasks

[FR] Samples which extend samples from other datasets #5275

jnewb1 opened this issue Dec 13, 2024 · 3 comments
Labels
feature Work on a feature request

Comments

@jnewb1
Copy link
Contributor

jnewb1 commented Dec 13, 2024

Proposal Summary

I would like to be able to have samples which extend other samples in another dataset. For instance, we have images which have different types of annotations or labels. Ideally our base dataset contains all the raw images, and then every other dataset simple references these entries. This would allow us to modify our base dataset and have these changes reflected in all of the other annotated datasets. Currently, we have to copy every sample to every dataset including metadata, embeddings, etc which means our database gets quite large.

What areas of FiftyOne does this feature affect?

  • App: FiftyOne application
  • Core: Core fiftyone Python library
  • Server: FiftyOne server

Details

I think there should be a sample type called fo.SampleReference, where you pass an existing sample in as well as any additional primitives / labels that are specific to this new sample. When you query the dataset, you get a sample with both the base labels as well as any labels in the SampleReference. Any fields from the base sample should be read only.

For instance:
Dataset 1: Base - contains images, ImageMetadata
Dataset 2: Labels - contains a reference to a sample in Base, as well as detection labels from a model (say yolo)
Dataset 3: Labels2 - contains a reference to a sample in Base, as well as detection labels for a different model

Now two parties can independently work on and improve the two datasets while using the same base dataset. If you add clip_vit_base32 embeddings to the base dataset, both datasets can use this for filtering, searching, etc. If your labels2 dataset is quite small (maybe 1/10th of labels1) you get much better performance for queries to this dataset than if you had combined everything into a single dataset.

Willingness to contribute

The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently
  • Yes. I would be willing to contribute this feature with guidance from the FiftyOne community
  • No. I cannot contribute this feature at this time
@brimoor
Copy link
Contributor

brimoor commented Dec 15, 2024

Hi @jnewb1 👋

The motivation behind this feature request definitely makes sense. You have a single source of truth for certain data and you want this to be available (and shared) on multiple other datasets without having to manually keep the downstream datasets in sync with the source dataset.

Suppose dataset2 inherits some fields from dataset1 by reference.

Some design questions:

  1. Should these fields be editable on dataset2?
    • If so, then this should be immediately reflected on dataset1
    • If not, then reference fields would need to be read-only on dataset2
  2. Should reference fields be deleteable on dataset2? (ie dataset2.delete_sample_field(ref_field))
    • Presumably this would not delete the underlying field from dataset1

Adding fields to a dataset also carries some interface implications that must be satisfied in order for the dataset to work with the rest of the FiftyOne data model:

  1. Reference fields must be reflected in the dataset's schema (ie dataset2.get_field_schema() should include all reference fields)
  2. If a reference field is deleted from dataset1, it should be immediately deleted from dataset2.get_field_schema()
  3. If a reference field is loaded in the App, its reference fields should appear in the sidebar
  4. Reference fields must be included in Dataset._pipeline() and DatasetView._pipeline() so that dataset2.iter_samples() and dataset2.view().iter_samples() will include the fields
  5. Reference fields must be visualizable in the sample grid
  6. Users must be able to create views (eg dataset2.filter_labels(ref_field, ...) and dataset2.to_patches(ref_field)) on reference fields. This more or less happens for free if Implication 3 is satisfied
  7. Users must be able to run aggregations (eg dataset2.bounds(ref_field) or dataset2.count_values(ref_field)) on reference fields. This also more or less happens for free if Implication 3 is satisfied

Some observations:

  • Implication 2 would take some effort, but is doable.
  • Implication 3 basically happens "for free" as long as Implication 1 is true.
  • Implications 5-7 basically happen "for free" as long as Implication 4 is true

Implication 4 is the concerning one. The natural way to achieve this would be to prepend a $lookup stage to the pipelines to pull in reference fields. But $lookup is generally slow in MongoDB and may have negative performance implications, like the inability or extra work required to leverage database indexes for faster querying.

TLDR, this will be a complex feature to properly implement 🤓

@jnewb1
Copy link
Contributor Author

jnewb1 commented Dec 15, 2024

Hi, thanks for the comment :)

I agree it will be somewhat complex. I made a basic PR over here which passes some simple tests and uses a reference field. I'd like to do more benchmarking here to see what impacts the $lookup has. #5277

  • I think any fields that are referenced should be read only and not deletable. If you want to modify them you should be forced to modify the base dataset.
  • I think the reference dataset will only store fields that are added to the SampleReference, and refer to the base dataset for any additional fields. That way the schema will always be the most up to date version.
  • Other than the fact that they are read only, I think the fields should appear just the same as normal fields to any downstream functionality of fiftyone, such as the app, aggregations, etc.

@brimoor
Copy link
Contributor

brimoor commented Dec 15, 2024

Here's a functional version of read-only reference fields that are implemented by copying the values rather than dynamically looking them up.

Resources

The benefit of this approach is that it's minimally complex to implement and will be as fast as possible to use. Of course the downside is that you manually have to check + sync the reference field if the source dataset is updated.

from datetime import datetime

def create_reference_field(dataset, src_dataset, ref_field):
    """Adds a read-only `ref_field` to `dataset` whose values are sourced from
    the `ref_field` of `src_dataset`.
    """
    values = dict(zip(*src_dataset.values(["filepath", ref_field])))
    dataset.set_values(ref_field, values, key_field="filepath")
    field = dataset.get_field(ref_field)
    field.read_only = True
    field.info = {
        "source_dataset": src_dataset.name,
        "last_modified_at": datetime.utcnow(),
    }
    field.save()


def list_reference_fields(dataset):
    """Lists the reference fields on the given dataset."""
    return [
        path
        for path, field in dataset.get_field_schema().items()
        if "source_dataset" in (field.info or {})
    ]


def check_reference_field(dataset, ref_field):
    """Returns True/False whether the reference field needs updating."""
    field = dataset.get_field(ref_field)
    src_dataset = fo.load_dataset(field.info["source_dataset"])
    return field.info["last_modified_at"] < src_dataset.max("last_modified_at")


def update_reference_field(dataset, ref_field):
    """Updates the reference field on the dataset with the current values from
    the source dataset.
    """
    field = dataset.get_field(ref_field)
    field.read_only = False
    field.save()

    try:
        field = dataset.get_field(ref_field)
        src_dataset = fo.load_dataset(field.info["source_dataset"])
        values = dict(zip(*src_dataset.values(["filepath", ref_field])))
        dataset.set_values(ref_field, values, key_field="filepath")
        field.info["last_modified_at"] = datetime.utcnow()
    finally:
        field.read_only = True
        field.save()

Example usage:

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset1 = foz.load_zoo_dataset("quickstart")

dataset2 = dataset1.select_fields().clone()

# Add a `ground_truth` reference field to `dataset2` linked to `dataset1`
create_reference_field(dataset2, dataset1, "ground_truth")

assert len(list_reference_fields(dataset1)) == 0
assert len(list_reference_fields(dataset2)) == 1
assert len(dataset2.count_values("ground_truth.detections.label")) > 1
assert not check_reference_field(dataset2, "ground_truth")

# Delete some labels
del_view = dataset1.filter_labels("ground_truth", F("label") != "person")
dataset1.delete_labels(fields="ground_truth", view=del_view)

assert check_reference_field(dataset2, "ground_truth")

# Sync the reference field
update_reference_field(dataset2, "ground_truth")

assert len(dataset2.count_values("ground_truth.detections.label")) == 1
assert not check_reference_field(dataset2, "ground_truth")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Work on a feature request
Projects
None yet
Development

No branches or pull requests

2 participants