Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatch get_array_memory_size returns incorrect size if underlying buffers are shared #5969

Open
HammadB opened this issue Jun 27, 2024 · 1 comment
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@HammadB
Copy link

HammadB commented Jun 27, 2024

Describe the bug

The implementation of get_array_memory_size is incorrect according to its documentation which states that it "Returns the total number of bytes of memory occupied physically by this batch." If the underlying buffers are shared in the record batch, this function will overreport the size. This can happen for example if you write to the Arrow IPC format, as when your read back, as all data is continuous in one buffer.

https://docs.rs/arrow-array/52.0.0/src/arrow_array/record_batch.rs.html#472

To Reproduce

  1. Create a record batch
  2. Write to Arrow IPC
  3. Load it, and call get_array_memory_size -> size will be off by potentially many multiples

Expected behavior

I'd expect the sizing to be the actual total size across the unique buffers in the record batch.

Additional context

@HammadB HammadB added the bug label Jun 27, 2024
@tustvold
Copy link
Contributor

There is https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size that might be what you're looking for? If you create a RecordBatch from some portion of an IPC file buffer, it is unclear what is the correct value for this API to return.

We should probably better document that this is only ever going to be a best effort approximation and people should manage allocations themselves if they need accurate accounting

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Jun 28, 2024
HammadB added a commit to chroma-core/chroma that referenced this issue Jul 2, 2024
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Arrow will overreport the size of a buffer if the underlying buffers
are shared (apache/arrow-rs#5969).
- apache/arrow-rs#5554 Exposes the ability to
enforce alignment at write time. This PR enables this option explicitly
and upgrades arrow to take advantage of it. We don't change from the
default alignment but this is defensive.
- See the comments in get_size() for further understanding of this PR
(https://github.com/chroma-core/chroma/pull/2426/files#diff-03bcd4f01acfa68c46fcc974d3722fa621056b4e0f908d708a2d15028b0e99b1R410)
- Cleans up various error handling and documentation - adding explicit
errors and removing panics as needed
 - New functionality
	 - None

## Test plan
*How are these changes tested?*
Updated all tests in delta.rs to save, load and check the sizes match
- [x] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust

## Documentation Changes
None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants