-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in size estimation of array buffers #2991
Conversation
An off-by-one bug in `estimated_bytes_size` caused the size estimations of Tensors to completely ignore the actual payload, causing the memory consumption of images to be vastly underestimated. I decided to rewrite the code to be more clear for future readers.
b0f91ba
to
cae4e4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
This seems to be catching a bug in the e2e tests now! That's good, I guess :) But needs debugging… I should probably a dd a unit-test also to make sure the new code is correct, and won't regress. |
Co-authored-by: Jeremy Leibs <[email protected]>
So I've found the difference: the python version writes a validity bitmaps where the rust version doesn't. Maybe the rust arrow library omits them when they are all set or something. I will dig deeper. Still, this is a real difference in the data that |
So on the Rust size, DataType::Struct(vec![
Field {
name: "translation".to_owned(),
data_type: <crate::datatypes::Vec3D>::to_arrow_datatype(),
is_nullable: true,
metadata: [].into(),
},
Field {
name: "matrix".to_owned(),
data_type: <crate::datatypes::Mat3x3>::to_arrow_datatype(),
is_nullable: true,
metadata: [].into(),
},
Field {
name: "from_parent".to_owned(),
data_type: DataType::Boolean,
is_nullable: false,
metadata: [].into(),
},
]) while Python has this ( pa.struct(
[
pa.field(
"translation",
pa.list_(pa.field("item", pa.float32(), nullable=False, metadata={}), 3),
nullable=True,
metadata={},
),
pa.field(
"matrix",
pa.list_(pa.field("item", pa.float32(), nullable=False, metadata={}), 9),
nullable=True,
metadata={},
),
pa.field("from_parent", pa.bool_(), nullable=False, metadata={}),
]
), EDIT: @jleibs pointed out this is not wrong |
I think you're misreading the python. The field that's set non-nullible there is the inner nested type, not the outer type.
For clarity it would be nice if this were re-implemented recursively as:
But unless the inner types don't match those look like they are all doing the same thing. |
Oh, you're right… I got confused by how Rust and C++ calls out to other functions for the inner datatype, while Python repeats/inlines it. Nevermind then. |
The Python encoder is outputting validities with all zeroes, but the Rust and C++ arrow encoder omits the validity. Maybe it is an optimization in Rust and C++ to only output the validity if it is non-zero? That works if the encoder interprets a missing validity as "all nulls". From https://arrow.apache.org/docs/format/Columnar.html#sparse-union:
This sounds to me like an omitted validity map means "all set = no nulls"… I'm giving up for today. |
This doesn't match what I'm seeing. For me, Python & C++ both output the same encoding and Rust is the one that's different. |
# Conflicts: # crates/re_types/source_hash.txt # crates/re_types_builder/src/codegen/rust/serializer.rs
An off-by-one bug in
estimated_bytes_size
caused the size estimations of Tensors to completely ignore the actual payload, causing the memory consumption of images to be vastly underestimated.I decided to rewrite the code to be more clear for future readers.
This discovered a difference in how validity bitmaps are written by Rust and Python+C++, fixed in a828441.
To help find this problem a lot of work was also put into improving the output of
scripts/ci/run_e2e_roundtrip_tests.py
.Checklist