Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for nullable unions using a "virtual arm" approach #2684

Closed
teh-cmc opened this issue Jul 12, 2023 · 1 comment · Fixed by #2708
Closed

Implement support for nullable unions using a "virtual arm" approach #2684

teh-cmc opened this issue Jul 12, 2023 · 1 comment · Fixed by #2708
Assignees
Labels
🏹 arrow concerning arrow codegen/idl 🦀 Rust API Rust logging API

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Jul 12, 2023

Recap

For some reason unknown to me, contrary to every other Arrow types, Arrow unions don't have top-level bitmaps.
This means that they cannot express their own nullability, as opposed to the respective nullabilities of their children fields.

I.e. you can express this:

enum MyEnum {
    OptionalInt(Option<u32>),
    MandatoryBoolean(bool),
}

struct MyComponent {
    value: MyEnum,
}

but not this:

struct MyComponent {
    value: Option<MyEnum>,
}

To work around the issue (again, I have no clue as to why it has to be an issue to begin with), the spec suggest piggybacking on the nullability on an arbitrary picked children to determine the nullability of the enum as a whole.

See e.g. this example taken from the spec:
image

Not only is this inconsistent and really annoying to deal with on the (de)serialization paths, this is also ambiguous (how do I differentiate between a null enum vs. a null float in the first array in the example above?).

Worse: how could you piggyback on one of your children's nullability if none of your children are nullable in the first place? A situation that arises all over the place in our component definitions.

arrow2-convert's answer to the issue is, as far as I can tell, to put nullable data into something that is otherwise advertised as non-nullable and patch things up as they go in and out.
E.g. here's how a nullable Rotation3D is serialized today:

[crates/re_components/src/transform3d.rs:595] array.fields()[0].as_any().downcast_ref::<FixedSizeListArray>().unwrap() = FixedSizeListArray[None]
[crates/re_components/src/transform3d.rs:599] array.data_type() = FixedSizeList(
    Field {
        name: "item",
        data_type: Float32,
        is_nullable: false,
        metadata: {},
    },
    4,
)
[crates/re_components/src/transform3d.rs:600] array.validity() = Some(
    [0b_______0],
)

Here we can see a Quaternion, which is really a non-nullable FixedSizeList(f32, 4), being serialized as a null value anyhow so the outer union can be tracked as null in and of itself.

This is very much related to #795.

Solution

Emil proposed another approach: use a virtual branch instead in contexts where the union is supposed to be nullable.

That virtual branch would only be used (and visible) when serializing data in and out, and its values would always represent nulls.

@teh-cmc teh-cmc added 🏹 arrow concerning arrow 🦀 Rust API Rust logging API codegen/idl labels Jul 12, 2023
@teh-cmc teh-cmc self-assigned this Jul 12, 2023
@teh-cmc teh-cmc changed the title Implement support for nullable unions using a "virtual branch" approach Implement support for nullable unions using a "virtual arm" approach Jul 17, 2023
@emilk
Copy link
Member

emilk commented Jul 18, 2023

The C++ implementation of unions (added in #2707) already has an EMPTY state which is used in order to implement move-semantics (C++ move semantics require objects to be able to have some sort of "null" state), so this goes well with that. See for instance: https://github.com/rerun-io/rerun/pull/2707/files#diff-d6f57c55fc2a94ed41f28bc249f9e21cd45f66c3fc1c4632127e8e4ee630fd70

teh-cmc added a commit that referenced this issue Jul 19, 2023
**Commit by commit!**

This PR implements support for top-level nullable Arrow unions using the
"virtual arm" strategy.

Specifically, all unions derived from Flatbuffers definitions now have
an auto-generated arm at type index `0` of type `DataType::Null`.
The data values in this arm are irrelevant: the deserialization code
will treat the union as a whole as `null` any time the type buffer
indicates a value of `0`.

I've decided to make the auto-generated arm visible in the
auto-generated `DataType` definition because... why wouldn't we? It
makes it all much less magical and acts as a reminder that this thing is
indeed here.

The choice of using type index `0` rather than e.g. `nb_arms` is to make
things A) more stable and B) forwards compatible.
It cannot surprise end users in any way since user-defined types defined
in Flatbuffers would just do the right thing, while user-defined types
in raw Arrow don't even go through our (de)serialization logic but
rather through theirs.

In addition, this PR greatly improves error handling in the
auto-generated code, giving the end-user very specific context and a
context-correct backtrace when something goes wrong.


Fixes #2684

### What

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/2708) (if
applicable)

- [PR Build Summary](https://build.rerun.io/pr/2708)
- [Docs
preview](https://rerun.io/preview/pr%3Acmc%2Frust_virtual_union_arms/docs)
- [Examples
preview](https://rerun.io/preview/pr%3Acmc%2Frust_virtual_union_arms/examples)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow codegen/idl 🦀 Rust API Rust logging API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants