Skip to content

Commit

Permalink
End-to-end tagging: Rust (#8304)
Browse files Browse the repository at this point in the history
I had to give up on the idea of splitting this thing into neat little
PRs -- the enormous amount of extra work needed in this case is just not
worth it, it's not even close (turns out changing the definition of
`Component` has cascading consequences 😶).

I'll add a thorough description of what's going on to compensate, and
can walk someone through this if needed.

---

### Goals and non-goals

The goal of this PR is to get component tags in, store them, and then
get them out.

The goal of this PR is _not_ to port every single bit of component-name
based logic to component-descriptor based logic (including but certainly
not limited to datastore queries).
That will be the next step:
#8293.


### Types and traits

First and foremost, this ofc introduces the new `ComponentDescriptor`
type:
```rust
/// A [`ComponentDescriptor`] fully describes the semantics of a column of data.
///
/// Every component is uniquely identified by its [`ComponentDescriptor`].
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub struct ComponentDescriptor {
    /// Optional name of the `Archetype` associated with this data.
    ///
    /// `None` if the data wasn't logged through an archetype.
    ///
    /// Example: `rerun.archetypes.Points3D`.
    pub archetype_name: Option<ArchetypeName>,

    /// Optional name of the field within `Archetype` associated with this data.
    ///
    /// `None` if the data wasn't logged through an archetype.
    ///
    /// Example: `positions`.
    pub archetype_field_name: Option<ArchetypeFieldName>,

    /// Semantic name associated with this data.
    ///
    /// This is fully implied by `archetype_name` and `archetype_field`, but
    /// included for semantic convenience.
    ///
    /// Example: `rerun.components.Position3D`.
    pub component_name: ComponentName,
}
```

Note that this is a _Rerun_ type, not a _Sorbet_ type: i.e. it uses
Rerun terminology (archetypes, fields, etc), not Sorbet terminology.
As is now tradition, this terminology gets translated into its Sorbet
equivalent when leaving the land of internal Chunks for the land of
external RecordBatches and Dataframes.

`Component`s are now uniquely identified by a `ComponentDescriptor`
rather than a `ComponentName`:
```rust
/// A [`Component`] describes semantic data that can be used by any number of [`Archetype`]s.
///
/// Implementing the [`Component`] trait automatically derives the [`ComponentBatch`] implementation,
/// which makes it possible to work with lists' worth of data in a generic fashion.
pub trait Component: Loggable {
    /// Returns the complete [`ComponentDescriptor`] for this [`Component`].
    ///
    /// Every component is uniquely identified by its [`ComponentDescriptor`].
    //
    // NOTE: Builtin Rerun components don't (yet) have anything but a `ComponentName` attached to
    // them (other tags are injected at the Archetype level), therefore having a full
    // `ComponentDescriptor` might seem overkill.
    // It's not:
    // * Users might still want to register Components with specific tags.
    // * In the future, `ComponentDescriptor`s will very likely cover than Archetype-related tags
    //   (e.g. generics, metric units, etc).
    fn descriptor() -> ComponentDescriptor;

    /// The fully-qualified name of this component, e.g. `rerun.components.Position2D`.
    ///
    /// This is a trivial but useful helper for `Self::descriptor().component_name`.
    ///
    /// The default implementation already does the right thing: do not override unless you know
    /// what you're doing.
    /// `Self::name()` must exactly match the value returned by `Self::descriptor().component_name`,
    /// or undefined behavior ensues.
    //
    // TODO(cmc): The only reason we keep this around is for convenience, and the only reason we need this
    // convenience is because we're still in this weird half-way in-between state where some things
    // are still indexed by name. Remove this entirely once we've ported everything to descriptors.
    #[inline]
    fn name() -> ComponentName {
        Self::descriptor().component_name
    }
}
```
`Component::name` still exists for now, as a convenience during the
interim (that is, until we propagate `ComponentDescriptor` to every last
corner of the app).


`MaybeOwnedComponentBatch` now has the possibility to augment and/or
fully-override the `ComponentDescriptor` of the data within:
```rust
/// Some [`ComponentBatch`], optionally with an overridden [`ComponentDescriptor`].
///
/// Used by implementers of [`crate::AsComponents`] to both efficiently expose their component data
/// and assign the right tags given the surrounding context.
pub struct MaybeOwnedComponentBatch<'a> {
    /// The component data.
    pub batch: ComponentBatchCow<'a>,

    /// If set, will override the [`ComponentBatch`]'s [`ComponentDescriptor`].
    pub descriptor_override: Option<ComponentDescriptor>,
}
```
This is a crucial part of the story, as this is how e.g. archetypes
inject their own tags when component data gets logged on their behalf.

### Override model

The override model is simple:
* Every `Component` has an associated `ComponentDescriptor`.
* Every `ComponentBatch` inherits from its underlying `Component`'s
`ComponentDescriptor`.
* `AsComponents` has an opportunity to override each `ComponentBatch`'s
`ComponentDescriptor` (by means of `MaybeOwnedComponentBatch`.

The goal is to try and carry those semantics over the two other SDKs
(Python, C++), while somehow keeping changes to a minimum.


### Undefined behavior

Logging the same component multiple times on a single entity (e.g. by
logging different archetypes that share parts of their definitions) has
always been, for all intents and purposes, UB.

This PR propagates descriptors just enough to get things up and running,
no more no less. By which I mean that it is possible to get component
tags in and out of the system, but many things still assume that
`Component`s are uniquely identified by their names.
This means that some part of the codebase are still indexing things by
name, while others index by descriptor. Where these parts meet, what was
UB before is even more UB now, as we generally just pick one random
component among the ones available.
You'll see a lot of `get_first_component` in the code: every single one
of those is UB if there are multiple components under the same name (for
now!).

Debug builds assert for duplicated components, until we properly use
descriptors everywhere (remember: nothing should ever be indexed by
`ComponentName` in the future).


### Fully-qualified component names & column paths

`ComponentDescriptor` defines its fully-qualified name as such:
```rust
match (archetype_name, component_name, archetype_field_name) {
    (None, component_name, None) => component_name.to_owned(),
    (Some(archetype_name), component_name, None) => {
        format!("{archetype_name}:{component_name}")
    }
    (None, component_name, Some(archetype_field_name)) => {
        format!("{component_name}#{archetype_field_name}")
    }
    (Some(archetype_name), component_name, Some(archetype_field_name)) => {
        format!("{archetype_name}:{component_name}#{archetype_field_name}")
    }
}
```
which yields e.g.
`rerun.archetypes.Points3D:rerun.components.Position3D#positions`, which
is generally shortened to `Points3D:Position3D#positions` when there is
no ambiguity.

In the dataframe API, a fully-qualified column path now becomes
`{entity_path}@{archetype_name}:{component_name}#{archetype_field_name}`,
e.g.
`/my/[email protected]:rerun.components.Position3D#positions`
or `/my/points@Points3D:Position3D#positions`.

This syntax needs to be debated. I have intentionally disabled the
syntax in the dataframe APIs so as not to break anything
external-facing.


### Transport and metadata

`ArchetypeName` and `ArchetypeFieldName` are now exposed as
`rerun.archetype_name` and `rerun.archetype_field_name` in
`TransportChunk`'s arrow metadata.

I really cannot wait for a better metadata system.


### Performance

`ComponentDescriptor`s add an extra layer of mappings everywhere: we
used to have `IntMap<ComponentName, T>` all over the place, now we have
`IntMap<ComponentName, IntMap<ComponentDescriptor, T>>`.
The extra `ComponentName` layer is needed because it is very common to
want to look for anything matching a `ComponentName`, without any
further tags specified.

Like before, these are NoHash maps, so performance impact should be
minimal (`ComponentDescriptor` implements NoHash by xor'ing everything).


### Examples / testing / roundtrips

See:
* docs/snippets/all/descriptors/descr_builtin_archetype.rs
* docs/snippets/all/descriptors/descr_builtin_component.rs
* docs/snippets/all/descriptors/descr_custom_archetype.rs
* docs/snippets/all/descriptors/descr_custom_component.rs

These snippets play all roles at once, as usual. In particular they make
sure that all languages (well, only Rust for now, Python and C++ coming
soon) carry all the right tags in all the right situations.

---

* Part of #7948

---------

Co-authored-by: Andreas Reich <[email protected]>
  • Loading branch information
teh-cmc and Wumpf authored Dec 9, 2024
1 parent b3e35a3 commit 67a3cac
Show file tree
Hide file tree
Showing 345 changed files with 18,599 additions and 11,664 deletions.
1 change: 1 addition & 0 deletions Cargo.lock
Original file line number Diff line number Diff line change
Expand Up @@ -8024,6 +8024,7 @@ dependencies = [
"re_build_tools",
"rerun",
"rust-format",
"similar-asserts",
]

[[package]]
Expand Down
179 changes: 114 additions & 65 deletions crates/build/re_types_builder/src/codegen/rust/api.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ use crate::{
should_optimize_buffer_slice_deserialize,
},
serializer::quote_arrow_serializer,
util::{is_tuple_struct_from_obj, iter_archetype_components, quote_doc_line},
util::{is_tuple_struct_from_obj, quote_doc_line},
},
Target,
},
Expand Down Expand Up @@ -184,8 +184,8 @@ fn generate_object_file(
code.push_str("use ::re_types_core::external::arrow2;\n");
code.push_str("use ::re_types_core::SerializationResult;\n");
code.push_str("use ::re_types_core::{DeserializationResult, DeserializationError};\n");
code.push_str("use ::re_types_core::ComponentName;\n");
code.push_str("use ::re_types_core::{ComponentBatch, MaybeOwnedComponentBatch};\n");
code.push_str("use ::re_types_core::{ComponentDescriptor, ComponentName};\n");
code.push_str("use ::re_types_core::{ComponentBatch, ComponentBatchCowWithDescriptor};\n");

// NOTE: `TokenStream`s discard whitespacing information by definition, so we need to
// inject some of our own when writing to file… while making sure that don't inject
Expand Down Expand Up @@ -354,13 +354,13 @@ fn quote_struct(
#quoted_deprecation_notice
#quoted_struct

#quoted_heap_size_bytes
#quoted_trait_impls

#quoted_from_impl

#quoted_trait_impls

#quoted_builder

#quoted_heap_size_bytes
};

tokens
Expand Down Expand Up @@ -470,9 +470,9 @@ fn quote_union(
#(#quoted_fields,)*
}

#quoted_heap_size_bytes

#quoted_trait_impls

#quoted_heap_size_bytes
};

tokens
Expand Down Expand Up @@ -603,6 +603,18 @@ fn quote_enum(
#(#quoted_fields,)*
}

#quoted_trait_impls

// We implement `Display` to match the `PascalCase` name so that
// the enum variants are displayed in the UI exactly how they are displayed in code.
impl std::fmt::Display for #name {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
#(#display_match_arms,)*
}
}
}

impl ::re_types_core::reflection::Enum for #name {

#[inline]
Expand All @@ -629,18 +641,6 @@ fn quote_enum(
true
}
}

// We implement `Display` to match the `PascalCase` name so that
// the enum variants are displayed in the UI exactly how they are displayed in code.
impl std::fmt::Display for #name {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
#(#display_match_arms,)*
}
}
}

#quoted_trait_impls
};

tokens
Expand Down Expand Up @@ -1006,14 +1006,16 @@ fn quote_trait_impls_for_datatype_or_component(
quote! {
impl ::re_types_core::Component for #name {
#[inline]
fn name() -> ComponentName {
#fqname.into()
fn descriptor() -> ComponentDescriptor {
ComponentDescriptor::new(#fqname)
}
}
}
});

quote! {
#quoted_impl_component

::re_types_core::macros::impl_into_cow!(#name);

impl ::re_types_core::Loggable for #name {
Expand All @@ -1033,8 +1035,6 @@ fn quote_trait_impls_for_datatype_or_component(

#quoted_from_arrow2
}

#quoted_impl_component
}
}

Expand All @@ -1048,40 +1048,70 @@ fn quote_trait_impls_for_archetype(obj: &Object) -> TokenStream {
assert_eq!(kind, &ObjectKind::Archetype);

let display_name = re_case::to_human_case(name);
let archetype_name = &obj.fqname;
let name = format_ident!("{name}");

fn compute_components(
fn compute_component_descriptors(
obj: &Object,
attr: &'static str,
extras: impl IntoIterator<Item = String>,
requirement_attr_value: &'static str,
) -> (usize, TokenStream) {
let components = iter_archetype_components(obj, attr)
.chain(extras)
// Do *not* sort again, we want to preserve the order given by the datatype definition
.collect::<Vec<_>>();
let descriptors = obj
.fields
.iter()
.filter_map(move |field| {
field
.try_get_attr::<String>(requirement_attr_value)
.map(|_| {
let Some(component_name) = field.typ.fqname() else {
panic!("Archetype field must be an object/union or an array/vector of such")
};

let archetype_name = &obj.fqname;
let archetype_field_name = field.snake_case_name();

quote!(ComponentDescriptor {
archetype_name: Some(#archetype_name.into()),
component_name: #component_name.into(),
archetype_field_name: Some(#archetype_field_name.into()),
})
})
})
.collect_vec();

let num_components = components.len();
let quoted_components = quote!(#(#components.into(),)*);
let num_descriptors = descriptors.len();
let quoted_descriptors = quote!(#(#descriptors,)*);

(num_components, quoted_components)
(num_descriptors, quoted_descriptors)
}

let indicator_name = format!("{}Indicator", obj.name);
let indicator_fqname = format!("{}Indicator", obj.fqname).replace("archetypes", "components");

let quoted_indicator_name = format_ident!("{indicator_name}");
let quoted_indicator_doc =
format!("Indicator component for the [`{name}`] [`::re_types_core::Archetype`]");

let (num_required, required) = compute_components(obj, ATTR_RERUN_COMPONENT_REQUIRED, []);
let (num_recommended, recommended) =
compute_components(obj, ATTR_RERUN_COMPONENT_RECOMMENDED, [indicator_fqname]);
let (num_optional, optional) = compute_components(obj, ATTR_RERUN_COMPONENT_OPTIONAL, []);
let (num_required_descriptors, required_descriptors) =
compute_component_descriptors(obj, ATTR_RERUN_COMPONENT_REQUIRED);
let (mut num_recommended_descriptors, mut recommended_descriptors) =
compute_component_descriptors(obj, ATTR_RERUN_COMPONENT_RECOMMENDED);
let (num_optional_descriptors, optional_descriptors) =
compute_component_descriptors(obj, ATTR_RERUN_COMPONENT_OPTIONAL);

num_recommended_descriptors += 1;
recommended_descriptors = quote! {
#recommended_descriptors
ComponentDescriptor {
archetype_name: Some(#archetype_name.into()),
component_name: #indicator_name.into(),
archetype_field_name: None,
},
};

let num_components_docstring = quote_doc_line(&format!(
"The total number of components in the archetype: {num_required} required, {num_recommended} recommended, {num_optional} optional"
"The total number of components in the archetype: {num_required_descriptors} required, {num_recommended_descriptors} recommended, {num_optional_descriptors} optional"
));
let num_all = num_required + num_recommended + num_optional;
let num_all_descriptors =
num_required_descriptors + num_recommended_descriptors + num_optional_descriptors;

let quoted_field_names = obj
.fields
Expand All @@ -1099,13 +1129,13 @@ fn quote_trait_impls_for_archetype(obj: &Object) -> TokenStream {

// NOTE: The nullability we're dealing with here is the nullability of an entire array of components,
// not the nullability of individual elements (i.e. instances)!
if is_nullable {
let batch = if is_nullable {
if obj.attrs.has(ATTR_RERUN_LOG_MISSING_AS_EMPTY) {
if is_plural {
// Always log Option<Vec<C>> as Vec<V>, mapping None to empty batch
let component_type = quote_field_type_from_typ(&obj_field.typ, false).0;
quote! {
Some((
Some(
if let Some(comp_batch) = &self.#field_name {
(comp_batch as &dyn ComponentBatch)
} else {
Expand All @@ -1114,24 +1144,44 @@ fn quote_trait_impls_for_archetype(obj: &Object) -> TokenStream {
let empty_batch: &#component_type = EMPTY_BATCH.get_or_init(|| Vec::new());
(empty_batch as &dyn ComponentBatch)
}
).into())
)
}
} else {
// Always log Option<C>, mapping None to empty batch
quote! { Some((&self.#field_name as &dyn ComponentBatch).into()) }
quote!{ Some(&self.#field_name as &dyn ComponentBatch) }
}
} else {
if is_plural {
// Maybe logging an Option<Vec<C>>
quote! { self.#field_name.as_ref().map(|comp_batch| (comp_batch as &dyn ComponentBatch).into()) }
quote!{ self.#field_name.as_ref().map(|comp_batch| (comp_batch as &dyn ComponentBatch)) }
} else {
// Maybe logging an Option<C>
quote! { self.#field_name.as_ref().map(|comp| (comp as &dyn ComponentBatch).into()) }
quote!{ self.#field_name.as_ref().map(|comp| (comp as &dyn ComponentBatch)) }
}
}
} else {
// Always logging a Vec<C> or C
quote! { Some((&self.#field_name as &dyn ComponentBatch).into()) }
quote!{ Some(&self.#field_name as &dyn ComponentBatch) }
};

let Some(component_name) = obj_field.typ.fqname() else {
panic!("Archetype field must be an object/union or an array/vector of such")
};
let archetype_name = &obj.fqname;
let archetype_field_name = obj_field.snake_case_name();

quote! {
(#batch).map(|batch| {
::re_types_core::ComponentBatchCowWithDescriptor {
batch: batch.into(),
descriptor_override: Some(ComponentDescriptor {
archetype_name: Some(#archetype_name.into()),
archetype_field_name: Some((#archetype_field_name).into()),
component_name: (#component_name).into(),
}),
}
})

}
}))
};
Expand Down Expand Up @@ -1215,21 +1265,21 @@ fn quote_trait_impls_for_archetype(obj: &Object) -> TokenStream {
};

quote! {
static REQUIRED_COMPONENTS: once_cell::sync::Lazy<[ComponentName; #num_required]> =
once_cell::sync::Lazy::new(|| {[#required]});
static REQUIRED_COMPONENTS: once_cell::sync::Lazy<[ComponentDescriptor; #num_required_descriptors]> =
once_cell::sync::Lazy::new(|| {[#required_descriptors]});

static RECOMMENDED_COMPONENTS: once_cell::sync::Lazy<[ComponentName; #num_recommended]> =
once_cell::sync::Lazy::new(|| {[#recommended]});
static RECOMMENDED_COMPONENTS: once_cell::sync::Lazy<[ComponentDescriptor; #num_recommended_descriptors]> =
once_cell::sync::Lazy::new(|| {[#recommended_descriptors]});

static OPTIONAL_COMPONENTS: once_cell::sync::Lazy<[ComponentName; #num_optional]> =
once_cell::sync::Lazy::new(|| {[#optional]});
static OPTIONAL_COMPONENTS: once_cell::sync::Lazy<[ComponentDescriptor; #num_optional_descriptors]> =
once_cell::sync::Lazy::new(|| {[#optional_descriptors]});

static ALL_COMPONENTS: once_cell::sync::Lazy<[ComponentName; #num_all]> =
once_cell::sync::Lazy::new(|| {[#required #recommended #optional]});
static ALL_COMPONENTS: once_cell::sync::Lazy<[ComponentDescriptor; #num_all_descriptors]> =
once_cell::sync::Lazy::new(|| {[#required_descriptors #recommended_descriptors #optional_descriptors]});

impl #name {
#num_components_docstring
pub const NUM_COMPONENTS: usize = #num_all;
pub const NUM_COMPONENTS: usize = #num_all_descriptors;
}

#[doc = #quoted_indicator_doc]
Expand All @@ -1249,29 +1299,29 @@ fn quote_trait_impls_for_archetype(obj: &Object) -> TokenStream {
}

#[inline]
fn indicator() -> MaybeOwnedComponentBatch<'static> {
fn indicator() -> ComponentBatchCowWithDescriptor<'static> {
static INDICATOR: #quoted_indicator_name = #quoted_indicator_name::DEFAULT;
MaybeOwnedComponentBatch::Ref(&INDICATOR)
ComponentBatchCowWithDescriptor::new(&INDICATOR as &dyn ::re_types_core::ComponentBatch)
}

#[inline]
fn required_components() -> ::std::borrow::Cow<'static, [ComponentName]> {
fn required_components() -> ::std::borrow::Cow<'static, [ComponentDescriptor]> {
REQUIRED_COMPONENTS.as_slice().into()
}

#[inline]
fn recommended_components() -> ::std::borrow::Cow<'static, [ComponentName]> {
fn recommended_components() -> ::std::borrow::Cow<'static, [ComponentDescriptor]> {
RECOMMENDED_COMPONENTS.as_slice().into()
}

#[inline]
fn optional_components() -> ::std::borrow::Cow<'static, [ComponentName]> {
fn optional_components() -> ::std::borrow::Cow<'static, [ComponentDescriptor]> {
OPTIONAL_COMPONENTS.as_slice().into()
}

// NOTE: Don't rely on default implementation so that we can keep everything static.
#[inline]
fn all_components() -> ::std::borrow::Cow<'static, [ComponentName]> {
fn all_components() -> ::std::borrow::Cow<'static, [ComponentDescriptor]> {
ALL_COMPONENTS.as_slice().into()
}

Expand Down Expand Up @@ -1302,11 +1352,10 @@ fn quote_trait_impls_for_archetype(obj: &Object) -> TokenStream {
}

impl ::re_types_core::AsComponents for #name {
fn as_component_batches(&self) -> Vec<MaybeOwnedComponentBatch<'_>> {
fn as_component_batches(&self) -> Vec<ComponentBatchCowWithDescriptor<'_>> {
re_tracing::profile_function!();

use ::re_types_core::Archetype as _;

[#(#all_component_batches,)*].into_iter().flatten().collect()
}
}
Expand Down
19 changes: 0 additions & 19 deletions crates/build/re_types_builder/src/codegen/rust/util.rs
Original file line number Diff line number Diff line change
Expand Up @@ -50,25 +50,6 @@ pub fn is_tuple_struct_from_obj(obj: &Object) -> bool {
is_tuple_struct
}

pub fn iter_archetype_components<'a>(
obj: &'a Object,
requirement_attr_value: &'static str,
) -> impl Iterator<Item = String> + 'a {
assert_eq!(ObjectKind::Archetype, obj.kind);

obj.fields.iter().filter_map(move |field| {
field
.try_get_attr::<String>(requirement_attr_value)
.map(|_| {
if let Some(fqname) = field.typ.fqname() {
fqname.to_owned()
} else {
panic!("Archetype field must be an object/union or an array/vector of such")
}
})
})
}

pub fn string_from_quoted(
reporter: &Reporter,
acc: &TokenStream,
Expand Down
Loading

0 comments on commit 67a3cac

Please sign in to comment.