Expand python APIs to support new data API concepts #7455

jleibs · 2024-09-19T17:54:32Z

Updated Proposal:

Improved concept definitions

Recording -- The logical representation of Rerun-sourced data. Physically backed by a chunk store.
View -- A subset of a Recording, restricted to a specific timeline, set of entities, and/or archetypes/components.
- Mentally aligned with Views as depicted in the viewer. The idea of "creating a table view from a space view" should not be a stretch.
- The definition of a view fully defines the schema and columns contained in the dataset.
- Prior to filtering, a View contains 1 row for every log call that produced data that is included in the view. (or multiple rows for send_columns calls).
Filter -- A mechanism of restricting the rows that are contained in a view.
- Filters never do data transformation. The schema is unchanged. Each individual row is included or excluded in its entirety.
- .filter(TimeRange(start=..., end=...)) or maybe .filter_range(start=..., end=...))
Re-indexing -- A mechanism of defining the index values for an alternative set of rows
- view.using_index_values(self, values: ArrayLike)
LatestAt -- A mechanism of populating missing data from a given row
Select -- A mechanism of choosing a specific subset of columns from the query result
- Does not change how many rows are returned, but lets the user specify the exact subset (and order) of the columns they are interested in retrieving.
- .select(Timeline(), "Translation3D")

Python APIs

# Raises if number of recordings != 1
recording : Recording  = rr.data.load_recording("foo.rrd")

archive = rr.data.load_archive("foo.rrd")
recording = archive.recordings[0]

contents = {
  "world/points/**": [rr.Image],
  "metrics/**": [rr.Scalar]
}

contents = "world/**"

class Recording:
    def view(self, index: str, contents: ContentLike) -> RecordingView: ...

class RecordingView:
    def filter_range(self, min: float, max: float) -> RecordingView: ...
    def filter_times(self, times: ArrayLike) -> RecordingView: ...
    def filter_events_for(self, AnyColumn(...)) -> RecordingView: ...

    def using_index_values(self, times: ArrayLike) -> RecordingView: ...

    def latest_at(self) -> RecordingView: ...

    def select(self, *columns: AnyColumn) -> Iterator[pa.RecordBatch]: ...

    # Do we need this?
    def select_all(): ... 


# Alternative
expr = QueryExpression(...)

recording = load_recording(...)

recording.query(expr).select(...)

Original Proposal (archive):

Notes from exploration:

EntityPathFilter expression is redundant with column-selection
Different types of column filtering is very useful for exploration
Select operators feel natural for incrementally filtering a dataset independently of the latest-at / range queries
- Would be nice to just chain these in, or re-use a selection for multiple queries.
Implicit empty columns can be surprising. Some filter operations should probably return errors if the columns are not present.
Ultimately need a way to access deterministic/named columns in the context of the resulting Table
- Allowing a user-set name in the ComponentSelector
Would be nice to group together timeline + range like we do some other places in rust
POV component is confusing
- Really need a sampled multi-latest-at
- Could be more clearly expressed as something like:
```
let samples = dataset.logged_at(timeline, columns)
dataframe = dataset.latest_at(samples, columns)
```
  This has the dual benefit that it allows users to provide their own samples, or to choose sample-times for multiple columns.

Proposals

Start with python refinement, and then back-propagate into rust if we like it.

Selections

The python Dataset object will internally track a set of columns that will be used for all queries along with an Arc<ChunkStore>.

Introduce new select_ variant APIs on the Dataset:

dataset.select_entities(expr: str) -> Dataset
- NOTE: only mutates component-columns; no-op on control/time columns
dataset.select_components(components: Sequence[ComponentLike]) -> Dataset
- NOTE: only mutates component-columns; no-op on control/time columns
dataset.select_columns(column_selectors: : Sequence[ColumnSelector]) -> Dataset

Each of these has the potential to strictly filter/mutate the active set of descriptors relative to the previous step. I.e. first selection is from the complete set, each incremental selection only selects from the remaining set.

LatestAtQuery and RangeQuery

Our TimeType ambiguity continues to torment us.

The most ergonomic is clearly an API that looks like:

LatestAtQuery(timeline: str, at: int | float)
RangeQuery(timeline: str, min: int | float, max: int | float)

The big challenge here is that sane-looking APIs are ambiguous without knowledge of the timeline.

Concretely:

LatestAtQuery(timeline, 2.0) needs to map to the TimeInt 2 if timeline is a Sequence, 2000000000 if timeline is Temporal and the user is thinking in seconds, and 2 if the timeline is temporal and the user is thinking in nanos.

TODO: Still not sure what the right answer is here.

If we follow precedent from TimeRangeBoundary this ends up looking something:

Choice A

latest_at = rr.LatestAt("frame", rr.TimeValue(seq=42))
latest_at = rr.LatestAt("time", rr.TimeValue(seconds=42.5))
range = rr.Range("frame, min=rr.TimeValue(seq=10), max=rr.TimeValue(seq=17))

Choice B, with some parameter-exploding could be simplified down to:

latest_at = rr.LatestAt("frame", seq=42)
latest_at = rr.LatestAt("time", seconds=42.5)
range = rr.Range("frame", min_seq=10, max_seq=17)

# Not sure if we let the users do stupid things
range = rr.Range("time", min_seconds=10.0, max_nanos=17000000000)

Choice C, diverging from what we do in TimeRangeBoundary:

latest_at = rr.dataframe.LatestAt.sequence("frame", 42)
latest_at = rr.dataframe.LatestAt.sequence("time", 42.5)
range = rr.dataframe.Range.sequence("frame", min=10, max=17)
range = rr.dataframe.Range.seconds("time", min=10.0, max=17)

Queries

Since the selection is now carried with the Dataset, you can now execute a query directly without providing columns.

dataset.latest_at_query(latest_at: LatestAt)
dataset.range_query(range: Range, pov: ComponentSelector)

This means you can write a query like:

opf = rr.dataframe.load_recording("datasets/open_photogrammetry_format.rrd")

range = rr.dataframe.Range.sequence("image", min=10, max=17)

# This part is still annoying
pov = rr.dataframe.ColumnSelector("world/cameras/rgb", rr.components.Blob)

df = opf.select_entities("world/cameras/**").range_query(range, pov)

Column Naming

Selectors/Descriptors will be given a name.

This name will default to one of:

Control:RowId
Timeline:<timeline_name>
<entity_path>:<component_short_name>

When specifying a component selector, users have the option to call .rename_as() to change the name of the component.

These names are also valid INPUT to a ColumnSelector.

For example:

image = rr.dataframe.ColumnSelector("/world/camera/rgb:Blob").rename_as("image")

df = opf.select_columns(image).latest_at_query(query)

df = opf.select_columns(image).range_query(query, pov=image)

The text was updated successfully, but these errors were encountered:

nikolausWest · 2024-09-19T20:10:33Z

Awesome writeup!

The big challenge here is that sane-looking APIs are ambiguous without knowledge of the timeline.

One other possible direction to go here: What if you actually express the range or latest at query as an operation on the TimeColumnDescriptor? At that point you know what type it has in the store and can make it ergonomic. It's also symmetric in some way with how we handle other columns. It also avoids mistakes like doing rr.dataframe.Range.seconds("time", min=10.0, max=17) when "time" was in fact a poorly named sequence timeline.

(this is kind of a half baked idea so likely mega annoying in some obvious way)

jleibs · 2024-09-19T21:07:02Z

At that point you know what type it has in the store and can make it ergonomic.

I think it half-solves the problem, but it still doesn't actually handle seconds vs nanoseconds on a time-typed column, which is still its own problem. I think to do that we would need to introduce some kind of "natural units" metadata on the column, but that's also awkward and error prone.

I'm hesitant to pull in something like https://pypi.org/project/custom-literals/, but that's of course the kind of behavior that would really be nice.

nikolausWest · 2024-09-24T11:09:29Z

Something I'm wondering about is how we handle multiple recording id's here.

Multiple rrds could all have the same recording is, so something like this makes sense:

recording = rr.data.load_recording("first.rrd", "second.rrd")

However, we can't know up front if those files contain one or two recording ids. How do we handle that?

rr.data.load_recording returns a list
rr.data.load_recording raises if there are multiple recordings
rr.data.load_recording only takes a single file and you can merge them in the sdk instead.
rr.data.laod_recording adds a dictionary encoded column with recording id

The same goes for application id and any future user defined ids

jleibs · 2024-09-26T15:12:20Z

Reminder: we still need an API for filtering all-empty columns. Examples: unused transform components, indicator components, etc. from the select_all() context.

…he new query property (#7516) ### What This PR introduces a new `DataframeQueryV2` view property archetype which models the query according to the new dataframe API design (#7455) and the feature we actually want to support in the dataframe view (#7497). At this point, the new archetype is **NOT** used yet. It just lives alongside the previous iteration, which is still used by the actual view. The swap will occur later. <hr> Part of a series to address #6896 and #7498. All PRs: - #7515 - #7516 - #7527 - #7545 - #7551 - #7572 - #7573 ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7516?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7516?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7516) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`.

### What - First pass at implementing APIs for: #7455 - Introduces a new mechanism for directly exposing rust types into the python bridge via a .pyi definition Example notebook for testing ``` pixi run py-build-examples pixi run -e examples jupyter notebook tests/python/dataframe/examples.ipynb ``` ### Future work: - More docs / help strings - Remaining API features ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7357?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7357?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7357) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`.

jleibs added enhancement New feature or request 👀 needs triage This issue needs to be triaged by the Rerun team 🐍 Python API Python logging API ⛃ re_datastore affects the datastore itself and removed 👀 needs triage This issue needs to be triaged by the Rerun team labels Sep 19, 2024

This was referenced Sep 24, 2024

Expand rust APIs to support new Data API concepts #7495

Closed

Consider if we can make point-of-view components optional in dataframe queries #7461

Closed

This was referenced Sep 24, 2024

Dataframe: re-design UI based on new query model #7497

Closed

Dataframe view update and blueprint API (part 1): introduce fbs for the new query property #7516

Merged

jleibs mentioned this issue Oct 3, 2024

Add python bindings for new dataframe APIs #7357

Merged

6 tasks

jleibs assigned jleibs and teh-cmc Oct 4, 2024

jleibs changed the title ~~Dataframe API improvements~~ Update Expand python APIs to support new data API concepts Oct 5, 2024

jleibs unassigned teh-cmc Oct 5, 2024

jleibs changed the title ~~Update Expand python APIs to support new data API concepts~~ Expand python APIs to support new data API concepts Oct 5, 2024

jleibs mentioned this issue Oct 10, 2024

rerun_py.dataframe: Add APIs for using_index_values, fill_latest_at, and filter_is_not_null #7680

Merged

7 tasks

jleibs closed this as completed in #7680 Oct 11, 2024

jleibs closed this as completed in b69be17 Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand python APIs to support new data API concepts #7455

Expand python APIs to support new data API concepts #7455

jleibs commented Sep 19, 2024 •

edited

Loading

Notes from exploration:

Proposals

Selections

LatestAtQuery and RangeQuery

Queries

Column Naming

nikolausWest commented Sep 19, 2024 •

edited

Loading

jleibs commented Sep 19, 2024

nikolausWest commented Sep 24, 2024

jleibs commented Sep 26, 2024 •

edited

Loading

Expand python APIs to support new data API concepts #7455

Expand python APIs to support new data API concepts #7455

Comments

jleibs commented Sep 19, 2024 • edited Loading

Updated Proposal:

Improved concept definitions

Python APIs

Original Proposal (archive):

Notes from exploration:

Proposals

Selections

LatestAtQuery and RangeQuery

Queries

Column Naming

nikolausWest commented Sep 19, 2024 • edited Loading

jleibs commented Sep 19, 2024

nikolausWest commented Sep 24, 2024

jleibs commented Sep 26, 2024 • edited Loading

jleibs commented Sep 19, 2024 •

edited

Loading

nikolausWest commented Sep 19, 2024 •

edited

Loading

jleibs commented Sep 26, 2024 •

edited

Loading