Tagged components milestone 1: end-to-end tagging #7948

teh-cmc · 2024-10-30T15:34:13Z

Part of Tagged components #6889

The first milestone for tagged components will be for the tags to be carried through end-to-end.
That is:

You log some data to a recording, using any of our SDKs.
- The proper tags get automagically injected into the data.
You then query that recording via the dataframe API.
- You get the automagically injected tags back.

This implies, among other things, that codegen is properly set up and that all the data path is just smart enough to be able to carry all that information across the entire pipeline.

This does not cover:

Being able to log the same component multiple times to the same entity, even with different tags.
Being able to run tag-based queries.
Using the tags in place of indicators.
Interned ComponentDescriptors

Each of the following limitations are upcoming miltestones of their own.

The tags we're talking about are the following:

rerun.component_name aka sorbet.semantic_type
rerun.archetype_name aka sorbet.semantic_family
rerun.archetype_field_name aka sorbet.logical_type

The text was updated successfully, but these errors were encountered:

I had to give up on the idea of splitting this thing into neat little PRs -- the enormous amount of extra work needed in this case is just not worth it, it's not even close (turns out changing the definition of `Component` has cascading consequences 😶). I'll add a thorough description of what's going on to compensate, and can walk someone through this if needed. --- ### Goals and non-goals The goal of this PR is to get component tags in, store them, and then get them out. The goal of this PR is _not_ to port every single bit of component-name based logic to component-descriptor based logic (including but certainly not limited to datastore queries). That will be the next step: #8293. ### Types and traits First and foremost, this ofc introduces the new `ComponentDescriptor` type: ```rust /// A [`ComponentDescriptor`] fully describes the semantics of a column of data. /// /// Every component is uniquely identified by its [`ComponentDescriptor`]. #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)] pub struct ComponentDescriptor { /// Optional name of the `Archetype` associated with this data. /// /// `None` if the data wasn't logged through an archetype. /// /// Example: `rerun.archetypes.Points3D`. pub archetype_name: Option<ArchetypeName>, /// Optional name of the field within `Archetype` associated with this data. /// /// `None` if the data wasn't logged through an archetype. /// /// Example: `positions`. pub archetype_field_name: Option<ArchetypeFieldName>, /// Semantic name associated with this data. /// /// This is fully implied by `archetype_name` and `archetype_field`, but /// included for semantic convenience. /// /// Example: `rerun.components.Position3D`. pub component_name: ComponentName, } ``` Note that this is a _Rerun_ type, not a _Sorbet_ type: i.e. it uses Rerun terminology (archetypes, fields, etc), not Sorbet terminology. As is now tradition, this terminology gets translated into its Sorbet equivalent when leaving the land of internal Chunks for the land of external RecordBatches and Dataframes. `Component`s are now uniquely identified by a `ComponentDescriptor` rather than a `ComponentName`: ```rust /// A [`Component`] describes semantic data that can be used by any number of [`Archetype`]s. /// /// Implementing the [`Component`] trait automatically derives the [`ComponentBatch`] implementation, /// which makes it possible to work with lists' worth of data in a generic fashion. pub trait Component: Loggable { /// Returns the complete [`ComponentDescriptor`] for this [`Component`]. /// /// Every component is uniquely identified by its [`ComponentDescriptor`]. // // NOTE: Builtin Rerun components don't (yet) have anything but a `ComponentName` attached to // them (other tags are injected at the Archetype level), therefore having a full // `ComponentDescriptor` might seem overkill. // It's not: // * Users might still want to register Components with specific tags. // * In the future, `ComponentDescriptor`s will very likely cover than Archetype-related tags // (e.g. generics, metric units, etc). fn descriptor() -> ComponentDescriptor; /// The fully-qualified name of this component, e.g. `rerun.components.Position2D`. /// /// This is a trivial but useful helper for `Self::descriptor().component_name`. /// /// The default implementation already does the right thing: do not override unless you know /// what you're doing. /// `Self::name()` must exactly match the value returned by `Self::descriptor().component_name`, /// or undefined behavior ensues. // // TODO(cmc): The only reason we keep this around is for convenience, and the only reason we need this // convenience is because we're still in this weird half-way in-between state where some things // are still indexed by name. Remove this entirely once we've ported everything to descriptors. #[inline] fn name() -> ComponentName { Self::descriptor().component_name } } ``` `Component::name` still exists for now, as a convenience during the interim (that is, until we propagate `ComponentDescriptor` to every last corner of the app). `MaybeOwnedComponentBatch` now has the possibility to augment and/or fully-override the `ComponentDescriptor` of the data within: ```rust /// Some [`ComponentBatch`], optionally with an overridden [`ComponentDescriptor`]. /// /// Used by implementers of [`crate::AsComponents`] to both efficiently expose their component data /// and assign the right tags given the surrounding context. pub struct MaybeOwnedComponentBatch<'a> { /// The component data. pub batch: ComponentBatchCow<'a>, /// If set, will override the [`ComponentBatch`]'s [`ComponentDescriptor`]. pub descriptor_override: Option<ComponentDescriptor>, } ``` This is a crucial part of the story, as this is how e.g. archetypes inject their own tags when component data gets logged on their behalf. ### Override model The override model is simple: * Every `Component` has an associated `ComponentDescriptor`. * Every `ComponentBatch` inherits from its underlying `Component`'s `ComponentDescriptor`. * `AsComponents` has an opportunity to override each `ComponentBatch`'s `ComponentDescriptor` (by means of `MaybeOwnedComponentBatch`. The goal is to try and carry those semantics over the two other SDKs (Python, C++), while somehow keeping changes to a minimum. ### Undefined behavior Logging the same component multiple times on a single entity (e.g. by logging different archetypes that share parts of their definitions) has always been, for all intents and purposes, UB. This PR propagates descriptors just enough to get things up and running, no more no less. By which I mean that it is possible to get component tags in and out of the system, but many things still assume that `Component`s are uniquely identified by their names. This means that some part of the codebase are still indexing things by name, while others index by descriptor. Where these parts meet, what was UB before is even more UB now, as we generally just pick one random component among the ones available. You'll see a lot of `get_first_component` in the code: every single one of those is UB if there are multiple components under the same name (for now!). Debug builds assert for duplicated components, until we properly use descriptors everywhere (remember: nothing should ever be indexed by `ComponentName` in the future). ### Fully-qualified component names & column paths `ComponentDescriptor` defines its fully-qualified name as such: ```rust match (archetype_name, component_name, archetype_field_name) { (None, component_name, None) => component_name.to_owned(), (Some(archetype_name), component_name, None) => { format!("{archetype_name}:{component_name}") } (None, component_name, Some(archetype_field_name)) => { format!("{component_name}#{archetype_field_name}") } (Some(archetype_name), component_name, Some(archetype_field_name)) => { format!("{archetype_name}:{component_name}#{archetype_field_name}") } } ``` which yields e.g. `rerun.archetypes.Points3D:rerun.components.Position3D#positions`, which is generally shortened to `Points3D:Position3D#positions` when there is no ambiguity. In the dataframe API, a fully-qualified column path now becomes `{entity_path}@{archetype_name}:{component_name}#{archetype_field_name}`, e.g. `/my/[email protected]:rerun.components.Position3D#positions` or `/my/points@Points3D:Position3D#positions`. This syntax needs to be debated. I have intentionally disabled the syntax in the dataframe APIs so as not to break anything external-facing. ### Transport and metadata `ArchetypeName` and `ArchetypeFieldName` are now exposed as `rerun.archetype_name` and `rerun.archetype_field_name` in `TransportChunk`'s arrow metadata. I really cannot wait for a better metadata system. ### Performance `ComponentDescriptor`s add an extra layer of mappings everywhere: we used to have `IntMap<ComponentName, T>` all over the place, now we have `IntMap<ComponentName, IntMap<ComponentDescriptor, T>>`. The extra `ComponentName` layer is needed because it is very common to want to look for anything matching a `ComponentName`, without any further tags specified. Like before, these are NoHash maps, so performance impact should be minimal (`ComponentDescriptor` implements NoHash by xor'ing everything). ### Examples / testing / roundtrips See: * docs/snippets/all/descriptors/descr_builtin_archetype.rs * docs/snippets/all/descriptors/descr_builtin_component.rs * docs/snippets/all/descriptors/descr_custom_archetype.rs * docs/snippets/all/descriptors/descr_custom_component.rs These snippets play all roles at once, as usual. In particular they make sure that all languages (well, only Rust for now, Python and C++ coming soon) carry all the right tags in all the right situations. --- * Part of #7948 --------- Co-authored-by: Andreas Reich <[email protected]>

@Wumpf

Implemented with the help of @Wumpf. This semantically mimics very closely the way things are done in Rust, minus all technical differences due to the differences between both the languages and the SDKs. For that reason, everything stated in #8304 (comment) basically applies as-is. Pretty happy about it, I must say. * DNM: requires #8304 * Part of #7948 --------- Co-authored-by: Andreas Reich <[email protected]>

This semantically mimics very closely the way things are done in Rust, minus all technical differences due to the differences between both the languages and the SDKs. For that reason, everything stated in #8304 (comment) basically applies as-is. Pretty happy about it, I must say. * DNM: requires #8316 * Part of #7948

teh-cmc added enhancement New feature or request ⛃ re_datastore affects the datastore itself 🔍 re_query affects re_query itself 🔩 data model labels Oct 30, 2024

teh-cmc self-assigned this Oct 30, 2024

teh-cmc mentioned this issue Nov 13, 2024

Tagged components milestone 2: no more indicators #8129

Open

This was referenced Dec 3, 2024

Tagged components milestone 3: descriptor-driven and partial queries #8293

Open

End-to-end tagging: Rust #8304

Merged

End-to-end tagging: C++ #8316

Merged

End-to-end tagging: Python #8298

Merged

teh-cmc closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagged components milestone 1: end-to-end tagging #7948

Tagged components milestone 1: end-to-end tagging #7948

teh-cmc commented Oct 30, 2024 •

edited by emilk

Loading

Tagged components milestone 1: end-to-end tagging #7948

Tagged components milestone 1: end-to-end tagging #7948

Comments

teh-cmc commented Oct 30, 2024 • edited by emilk Loading

teh-cmc commented Oct 30, 2024 •

edited by emilk

Loading