[FR] Samples which extend samples from other datasets #5275

jnewb1 · 2024-12-13T21:30:03Z

Proposal Summary

I would like to be able to have samples which extend other samples in another dataset. For instance, we have images which have different types of annotations or labels. Ideally our base dataset contains all the raw images, and then every other dataset simple references these entries. This would allow us to modify our base dataset and have these changes reflected in all of the other annotated datasets. Currently, we have to copy every sample to every dataset including metadata, embeddings, etc which means our database gets quite large.

What areas of FiftyOne does this feature affect?

App: FiftyOne application
Core: Core fiftyone Python library
Server: FiftyOne server

Details

I think there should be a sample type called fo.SampleReference, where you pass an existing sample in as well as any additional primitives / labels that are specific to this new sample. When you query the dataset, you get a sample with both the base labels as well as any labels in the SampleReference. Any fields from the base sample should be read only.

For instance:
Dataset 1: Base - contains images, ImageMetadata
Dataset 2: Labels - contains a reference to a sample in Base, as well as detection labels from a model (say yolo)
Dataset 3: Labels2 - contains a reference to a sample in Base, as well as detection labels for a different model

Now two parties can independently work on and improve the two datasets while using the same base dataset. If you add clip_vit_base32 embeddings to the base dataset, both datasets can use this for filtering, searching, etc. If your labels2 dataset is quite small (maybe 1/10th of labels1) you get much better performance for queries to this dataset than if you had combined everything into a single dataset.

Willingness to contribute

The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently
Yes. I would be willing to contribute this feature with guidance from the FiftyOne community
No. I cannot contribute this feature at this time

The text was updated successfully, but these errors were encountered:

brimoor · 2024-12-15T19:02:39Z

Hi @jnewb1 👋

The motivation behind this feature request definitely makes sense. You have a single source of truth for certain data and you want this to be available (and shared) on multiple other datasets without having to manually keep the downstream datasets in sync with the source dataset.

Suppose dataset2 inherits some fields from dataset1 by reference.

Some design questions:

Should these fields be editable on dataset2?
- If so, then this should be immediately reflected on dataset1
- If not, then reference fields would need to be read-only on dataset2
Should reference fields be deleteable on dataset2? (ie dataset2.delete_sample_field(ref_field))
- Presumably this would not delete the underlying field from dataset1

Adding fields to a dataset also carries some interface implications that must be satisfied in order for the dataset to work with the rest of the FiftyOne data model:

Reference fields must be reflected in the dataset's schema (ie dataset2.get_field_schema() should include all reference fields)
If a reference field is deleted from dataset1, it should be immediately deleted from dataset2.get_field_schema()
If a reference field is loaded in the App, its reference fields should appear in the sidebar
Reference fields must be included in Dataset._pipeline() and DatasetView._pipeline() so that dataset2.iter_samples() and dataset2.view().iter_samples() will include the fields
Reference fields must be visualizable in the sample grid
Users must be able to create views (eg dataset2.filter_labels(ref_field, ...) and dataset2.to_patches(ref_field)) on reference fields. This more or less happens for free if Implication 3 is satisfied
Users must be able to run aggregations (eg dataset2.bounds(ref_field) or dataset2.count_values(ref_field)) on reference fields. This also more or less happens for free if Implication 3 is satisfied

Some observations:

Implication 2 would take some effort, but is doable.
Implication 3 basically happens "for free" as long as Implication 1 is true.
Implications 5-7 basically happen "for free" as long as Implication 4 is true

Implication 4 is the concerning one. The natural way to achieve this would be to prepend a $lookup stage to the pipelines to pull in reference fields. But $lookup is generally slow in MongoDB and may have negative performance implications, like the inability or extra work required to leverage database indexes for faster querying.

TLDR, this will be a complex feature to properly implement 🤓

jnewb1 · 2024-12-15T19:15:29Z

Hi, thanks for the comment :)

I agree it will be somewhat complex. I made a basic PR over here which passes some simple tests and uses a reference field. I'd like to do more benchmarking here to see what impacts the $lookup has. #5277

I think any fields that are referenced should be read only and not deletable. If you want to modify them you should be forced to modify the base dataset.
I think the reference dataset will only store fields that are added to the SampleReference, and refer to the base dataset for any additional fields. That way the schema will always be the most up to date version.
Other than the fact that they are read only, I think the fields should appear just the same as normal fields to any downstream functionality of fiftyone, such as the app, aggregations, etc.

brimoor · 2024-12-15T20:06:30Z

Here's a functional version of read-only reference fields that are implemented by copying the values rather than dynamically looking them up.

Resources

Read-only fields
Summary fields <-- these are implemented similarly to the solution below

The benefit of this approach is that it's minimally complex to implement and will be as fast as possible to use. Of course the downside is that you manually have to check + sync the reference field if the source dataset is updated.

from datetime import datetime

def create_reference_field(dataset, src_dataset, ref_field):
    """Adds a read-only `ref_field` to `dataset` whose values are sourced from
    the `ref_field` of `src_dataset`.
    """
    values = dict(zip(*src_dataset.values(["filepath", ref_field])))
    dataset.set_values(ref_field, values, key_field="filepath")
    field = dataset.get_field(ref_field)
    field.read_only = True
    field.info = {
        "source_dataset": src_dataset.name,
        "last_modified_at": datetime.utcnow(),
    }
    field.save()


def list_reference_fields(dataset):
    """Lists the reference fields on the given dataset."""
    return [
        path
        for path, field in dataset.get_field_schema().items()
        if "source_dataset" in (field.info or {})
    ]


def check_reference_field(dataset, ref_field):
    """Returns True/False whether the reference field needs updating."""
    field = dataset.get_field(ref_field)
    src_dataset = fo.load_dataset(field.info["source_dataset"])
    return field.info["last_modified_at"] < src_dataset.max("last_modified_at")


def update_reference_field(dataset, ref_field):
    """Updates the reference field on the dataset with the current values from
    the source dataset.
    """
    field = dataset.get_field(ref_field)
    field.read_only = False
    field.save()

    try:
        field = dataset.get_field(ref_field)
        src_dataset = fo.load_dataset(field.info["source_dataset"])
        values = dict(zip(*src_dataset.values(["filepath", ref_field])))
        dataset.set_values(ref_field, values, key_field="filepath")
        field.info["last_modified_at"] = datetime.utcnow()
    finally:
        field.read_only = True
        field.save()

Example usage:

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset1 = foz.load_zoo_dataset("quickstart")

dataset2 = dataset1.select_fields().clone()

# Add a `ground_truth` reference field to `dataset2` linked to `dataset1`
create_reference_field(dataset2, dataset1, "ground_truth")

assert len(list_reference_fields(dataset1)) == 0
assert len(list_reference_fields(dataset2)) == 1
assert len(dataset2.count_values("ground_truth.detections.label")) > 1
assert not check_reference_field(dataset2, "ground_truth")

# Delete some labels
del_view = dataset1.filter_labels("ground_truth", F("label") != "person")
dataset1.delete_labels(fields="ground_truth", view=del_view)

assert check_reference_field(dataset2, "ground_truth")

# Sync the reference field
update_reference_field(dataset2, "ground_truth")

assert len(dataset2.count_values("ground_truth.detections.label")) == 1
assert not check_reference_field(dataset2, "ground_truth")

jnewb1 added the feature Work on a feature request label Dec 13, 2024

jnewb1 mentioned this issue Dec 15, 2024

[WIP] Add fo.SampleReference as a way to reference other samples in other datasets #5277

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Samples which extend samples from other datasets #5275

[FR] Samples which extend samples from other datasets #5275

jnewb1 commented Dec 13, 2024 •

edited

Loading

brimoor commented Dec 15, 2024 •

edited

Loading

jnewb1 commented Dec 15, 2024

brimoor commented Dec 15, 2024 •

edited

Loading

[FR] Samples which extend samples from other datasets #5275

[FR] Samples which extend samples from other datasets #5275

Comments

jnewb1 commented Dec 13, 2024 • edited Loading

Proposal Summary

What areas of FiftyOne does this feature affect?

Details

Willingness to contribute

brimoor commented Dec 15, 2024 • edited Loading

jnewb1 commented Dec 15, 2024

brimoor commented Dec 15, 2024 • edited Loading

jnewb1 commented Dec 13, 2024 •

edited

Loading

brimoor commented Dec 15, 2024 •

edited

Loading

brimoor commented Dec 15, 2024 •

edited

Loading