Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in size estimation of array buffers #2991

Merged
merged 15 commits into from
Aug 16, 2023
Merged

Conversation

emilk
Copy link
Member

@emilk emilk commented Aug 15, 2023

An off-by-one bug in estimated_bytes_size caused the size estimations of Tensors to completely ignore the actual payload, causing the memory consumption of images to be vastly underestimated.

I decided to rewrite the code to be more clear for future readers.

This discovered a difference in how validity bitmaps are written by Rust and Python+C++, fixed in a828441.

To help find this problem a lot of work was also put into improving the output of scripts/ci/run_e2e_roundtrip_tests.py.

Checklist

@emilk emilk added 🪳 bug Something isn't working 🏹 arrow concerning arrow labels Aug 15, 2023
@Wumpf Wumpf self-requested a review August 15, 2023 16:52
An off-by-one bug in `estimated_bytes_size` caused the size estimations
of Tensors to completely ignore the actual payload, causing the memory
consumption of images to be vastly underestimated.

I decided to rewrite the code to be more clear for future readers.
@emilk emilk force-pushed the emilk/fix-size-estimation-bug branch from b0f91ba to cae4e4e Compare August 15, 2023 16:53
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

@emilk
Copy link
Member Author

emilk commented Aug 15, 2023

This seems to be catching a bug in the e2e tests now! That's good, I guess :) But needs debugging…

I should probably a dd a unit-test also to make sure the new code is correct, and won't regress.

@emilk
Copy link
Member Author

emilk commented Aug 15, 2023

It is translation_and_mat3x3/identity that fails:

image

and with a slightly different diff each time:

Screenshot 2023-08-15 at 20 56 41

Co-authored-by: Jeremy Leibs <[email protected]>
@emilk
Copy link
Member Author

emilk commented Aug 15, 2023

So I've found the difference: the python version writes a validity bitmaps where the rust version doesn't. Maybe the rust arrow library omits them when they are all set or something. I will dig deeper. Still, this is a real difference in the data that rerun compare didn't catch until now. Perhaps though it is a difference that should be ignored.

@emilk
Copy link
Member Author

emilk commented Aug 15, 2023

So on the Rust size, TranslationAndMat3x3 has this datatype (crates/re_types/src/datatypes/translation_and_mat3x3.rs):

DataType::Struct(vec![
    Field {
        name: "translation".to_owned(),
        data_type: <crate::datatypes::Vec3D>::to_arrow_datatype(),
        is_nullable: true,
        metadata: [].into(),
    },
    Field {
        name: "matrix".to_owned(),
        data_type: <crate::datatypes::Mat3x3>::to_arrow_datatype(),
        is_nullable: true,
        metadata: [].into(),
    },
    Field {
        name: "from_parent".to_owned(),
        data_type: DataType::Boolean,
        is_nullable: false,
        metadata: [].into(),
    },
])

while Python has this (rerun_py/rerun_sdk/rerun/_rerun2/datatypes/transform3d.py):

pa.struct(
    [
        pa.field(
            "translation",
            pa.list_(pa.field("item", pa.float32(), nullable=False, metadata={}), 3),
            nullable=True,
            metadata={},
        ),
        pa.field(
            "matrix",
            pa.list_(pa.field("item", pa.float32(), nullable=False, metadata={}), 9),
            nullable=True,
            metadata={},
        ),
        pa.field("from_parent", pa.bool_(), nullable=False, metadata={}),
    ]
),

EDIT: @jleibs pointed out this is not wrong

@jleibs
Copy link
Member

jleibs commented Aug 15, 2023

Notice how Rust sets the first fields as nullable, but Python doesn't!

I think you're misreading the python. The field that's set non-nullible there is the inner nested type, not the outer type.

pa.field("matrix", pa.list_(pa.field("item", pa.float32(), False, {}), 9), True, {})

For clarity it would be nice if this were re-implemented recursively as:

pa.field("matrix", Vec3D.data_type(), True, {})

But unless the inner types don't match those look like they are all doing the same thing.

@emilk
Copy link
Member Author

emilk commented Aug 15, 2023

Oh, you're right… I got confused by how Rust and C++ calls out to other functions for the inner datatype, while Python repeats/inlines it. Nevermind then.

@emilk
Copy link
Member Author

emilk commented Aug 15, 2023

The Python encoder is outputting validities with all zeroes, but the Rust and C++ arrow encoder omits the validity.

Maybe it is an optimization in Rust and C++ to only output the validity if it is non-zero? That works if the encoder interprets a missing validity as "all nulls".

From https://arrow.apache.org/docs/format/Columnar.html#sparse-union:

Arrays having a 0 null count may choose to not allocate the validity bitmap; how this is represented depends on the implementation (for example, a C++ implementation may represent such an “absent” validity bitmap using a NULL pointer). Implementations may choose to always allocate a validity bitmap anyway as a matter of convenience. Consumers of Arrow arrays should be ready to handle those two possibilities.

This sounds to me like an omitted validity map means "all set = no nulls"… I'm giving up for today.

@jleibs
Copy link
Member

jleibs commented Aug 15, 2023

The Python encoder is outputting validities with all zeroes, but the Rust and C++ arrow encoder omits the validity.

This doesn't match what I'm seeing. For me, Python & C++ both output the same encoding and Rust is the one that's different.

# Conflicts:
#	crates/re_types/source_hash.txt
#	crates/re_types_builder/src/codegen/rust/serializer.rs
@emilk emilk merged commit 04d5491 into main Aug 16, 2023
@emilk emilk deleted the emilk/fix-size-estimation-bug branch August 16, 2023 07:18
@Wumpf Wumpf changed the title Fix big bug in size estimation of array buffers Fix bug in size estimation of array buffers Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow 🪳 bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants