Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Jul 9, 2025

Which issue does this PR close?

Rationale for this change

Variants naturally work with u32 offsets, field ids, etc. Widening them artificially to usize on 64-bit architectures causes several problems:

  1. A majority of developers will be using 64-bit architectures, and are unlikely to think about integer overflow issues when working with usize. But it's actually quite easy for malicious data or buggy code to overflow the u32 values that variant actually relies on. Worse, it becomes difficult, if not impossible, to validate the code's resistance to 32-bit integer overflow, when manipulating usize values on 64-bit hardware.
  2. Related to 1/, casting from usize to u32 can clip the value on 64-bit hardware, which makes it harder to reason about the code's correctness (always wondering whether the value might be larger than 32-bits can hold). In contrast, casting from u32 to usize is safe in spite of being fallible in rust (assumes we do not need to support 16-bit architectures).
  3. The variant-related data structures occupy significantly more space than they need to, when storing (64-bit) usize offsets instead of u32.

What changes are included in this PR?

Store all variant-related offsets as u32 instead of usize. The VariantMetadata, VariantObject and VariantList structs shrink to 32/64/64 bytes (previously 40/88/80 bytes).

Also, rename OffsetSizeBytes::unpack_usize[_at_offset] methods to unpack_u32[_at_offset], to more accurately reflect what they actually do now.

Are these changes tested?

Existing unit tests cover the use of these values; new static assertions will catch any future size changes.

Are there any user-facing changes?

VariantMetadata is no longer Copy, reflecting the fact that this PR still leaves it 2x larger than a fat pointer.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 9, 2025
header
.num_elements_size
.unpack_usize_at_offset(value, NUM_HEADER_BYTES, 0)?;
.unpack_u32_at_offset(value, NUM_HEADER_BYTES as _, 0)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an annoying side effect of using a named constant... the literal 1 would "just work" for both u32 and usize.

// Use the last offset to upper-bound the value buffer
let last_offset = new_self
.get_offset(num_elements)?
.get_offset(num_elements as _)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than do a bunch of try_into().map_err(...) calls, just admit that converting u32 to usize is infallible for all practical purposes -- I seriously doubt arrow-rs can run on 16-bit hardware where usize might be only 16 bits.

(I don't love blind as _ casting in general -- too easy to cast to something unexpected or ignore the implications of the cast -- but it seems ok in this specific set of cases)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is fine

.checked_add(first_value_byte)
.ok_or_else(|| overflow_error("variant array size"))?;
new_self.value = slice_from_slice(value, ..last_offset)?;
new_self.value = slice_from_slice(value, ..last_offset as _)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the SliceIndex trait only works for usize, so we have to widen the u32 values whenever we create one.

slice_from_slice_at_offset(self.value, self.first_value_byte, byte_range)?;
Variant::try_new_with_metadata_and_shallow_validation(self.metadata, value_bytes)
slice_from_slice_at_offset(self.value, self.first_value_byte as _, byte_range)?;
Variant::try_new_with_metadata_and_shallow_validation(self.metadata.clone(), value_bytes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariantMetadata is no longer Copy

Comment on lines 140 to 143
#[cfg(test)]
const _: () = if std::mem::size_of::<VariantMetadata>() != 32 {
panic!("VariantMetadata changed size, which will impact VariantList and VariantObject");
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snippet will produce a compilation failure if the struct size changes.
Seemed more robust/immediate than a doc test or unit test.

fn iter_try_with_shallow_validation(
&self,
) -> impl Iterator<Item = Result<Variant<'m, 'v>, ArrowError>> + '_ {
(0..self.len()).map(move |i| self.try_get_with_shallow_validation(i))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing to move here...

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich -- this makes sense to me and I think is a nice change

// Use the last offset to upper-bound the value buffer
let last_offset = new_self
.get_offset(num_elements)?
.get_offset(num_elements as _)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is fine

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

FYI @friendlymatthew

Comment on lines 42 to 51
impl VariantMetadataHeader {
// Hide the cast
const fn offset_size(&self) -> usize {
self.offset_size as usize
const fn offset_size(&self) -> u32 {
self.offset_size as u32
}

// Avoid materializing this offset, since it's cheaply and safely computable
const fn first_offset_byte(&self) -> usize {
const fn first_offset_byte(&self) -> u32 {
NUM_HEADER_BYTES + self.offset_size()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi VariantMetadataHeader is currently a u8 that encodes 3 pieces of information. I'm wondering if, instead of storing each piece separately as fields, we could store just the u8 itself and extract the individual components using bitmasking when needed.

If we are aiming to minimize the byte footprint, it's a bit unfortunate that we're storing 3 times more bytes than necessary fro this data. Plus, deriving the values from the byte is not computationally expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will change anything to shrink the header from 3 bytes to 1 byte?
VariantMetadata currently has 4 bytes padding due to its 8-byte alignment, and decreasing the header size would just change that to 6 bytes padding.

}

// We don't want this to grow because it could hurt performance of a frequently-created type.
const _: () = crate::utils::expect_size_of::<Variant>(80);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New addition -- I encapsulated the size check into a const helper function, whose compilation failure includes the object's actual size (otherwise, have to guess what the size was):

error[E0080]: evaluation of constant value failed
   --> parquet-variant/src/utils.rs:139:17
    |
139 |         let _ = ["";0][size];
    |                 ^^^^^^^^^^^^ index out of bounds: the length is 0 but the index is 80
    |
note: inside `utils::expect_size_of::<variant::Variant<'_, '_>>`
   --> parquet-variant/src/utils.rs:139:17
    |
139 |         let _ = ["";0][size];
    |                 ^^^^^^^^^^^^
note: inside `variant::_`
   --> parquet-variant/src/variant.rs:260:15
    |
260 | const _: () = crate::utils::expect_size_of::<Variant>(64);
    |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As can be seen, a check for Variant itself exposes the fact that it's 80 bytes. I can't figure out why -- VariantObject and VariantList are the only big enum variants (the next-biggest is 32 bytes)??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming 8-byte alignment, and observing that we use explicit structs for enum variant payloads which will prevent layout optimizations, I would expect the (one-byte) discriminator for Variant to push the size from 64 bytes (biggest enum variant payload) to 72 bytes (next alignment boundary). Where did the other 8 bytes come from??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch... figured it out -- i128 has 16-byte alignment, which poisons VariantDecimal16 (whose u8 scale pushes the size from 16 to 17 to 32 bytes) and Variant (whose one-byte discriminator pushes the size from 64 to 65 to 80 bytes).

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Looks like this PR has some clippy failures too -- once that is clean I'll plan to merge

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

I merged up from main to resolve a logical conflcit

@scovich
Copy link
Contributor Author

scovich commented Jul 11, 2025

Looks like docs job is broken upstream?

error: unnecessary parentheses around closure body
   --> arrow-ord/src/cmp.rs:276:43
    |
276 |                     let c = |((l, r), n)| ((l ^ r) | (l & r & n));
    |                                           ^                     ^
    |
    = note: `-D unused-parens` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(unused_parens)]`
help: remove these parentheses
    |
276 -                     let c = |((l, r), n)| ((l ^ r) | (l & r & n));
276 +                     let c = |((l, r), n)| (l ^ r) | (l & r & n);
    |

 Documenting arrow-cast v55.2.0 (/__w/arrow-rs/arrow-rs/arrow-cast)
error: could not document `arrow-ord`

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Looks like docs job is broken upstream?

@viirya fixed it in #7898

@scovich
Copy link
Contributor Author

scovich commented Jul 11, 2025

Looks like a bunch more logical conflicts... also I'm seeing this:

warning: unknown lint: `unnecessary_transmutes`
   --> arrow-array/src/arithmetic.rs:424:17
    |
424 |         #[allow(unnecessary_transmutes)]
    |                 ^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(unknown_lints)]` on by default

warning: unknown lint: `unnecessary_transmutes`
   --> arrow-array/src/arithmetic.rs:430:17
    |
430 |         #[allow(unnecessary_transmutes)]
    |                 ^^^^^^^^^^^^^^^^^^^^^^

warning: unknown lint: `unnecessary_transmutes`
   --> arrow-array/src/arithmetic.rs:441:17
    |
441 |         #[allow(unnecessary_transmutes)]
    |                 ^^^^^^^^^^^^^^^^^^^^^^

warning: unknown lint: `unnecessary_transmutes`
   --> arrow-array/src/arithmetic.rs:447:17
    |
447 |         #[allow(unnecessary_transmutes)]
    |                 ^^^^^^^^^^^^^^^^^^^^^^

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

gogogogo

@alamb alamb merged commit 42b6c17 into apache:main Jul 11, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Thanks again @scovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] VariantMetadata, VariantList and VariantObject are too big for Copy

3 participants