Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sending a DataCell's size (& other metadata) over the wire #1760

Closed
Tracked by #1899
teh-cmc opened this issue Apr 4, 2023 · 2 comments · Fixed by #6439
Closed
Tracked by #1899

Support sending a DataCell's size (& other metadata) over the wire #1760

teh-cmc opened this issue Apr 4, 2023 · 2 comments · Fixed by #6439
Assignees
Labels
🏹 arrow concerning arrow 📉 performance Optimization, memory use, etc

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Apr 4, 2023

This would allow us to compute the size of DataCells (a very costly operation) on the clients and therefore:

  • distribute the load to the clients rather than submerging the server
  • do the computation on the batching cold path, where we have all the time in the world
@teh-cmc teh-cmc added 🏹 arrow concerning arrow 📉 performance Optimization, memory use, etc labels Apr 4, 2023
@teh-cmc teh-cmc changed the title Support sending a DataCell's size over the wire Support sending a DataCell's size (& other metadata) over the wire Apr 4, 2023
@teh-cmc
Copy link
Member Author

teh-cmc commented Apr 7, 2023

That would also serialize the cell sizes to disk when saving the store to an rrd file, meaning reloading it later on will be much faster.

@teh-cmc
Copy link
Member Author

teh-cmc commented Apr 18, 2023

The size computation is now happening on the clients' no matter what (we need the value for the size_bytes trigger of the batching system), so not sending it over the wire is a literal waste of compute resources.

@teh-cmc teh-cmc self-assigned this May 16, 2024
teh-cmc added a commit that referenced this issue May 31, 2024
A `TransportChunk` is a `Chunk` that is ready for transport and/or
storage.
It is very cheap to go from `Chunk` to a `TransportChunk` and
vice-versa.

A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a
stable ABI, and can be cheaply send across process boundaries.
`arrow2` has no `RecordBatch` type; we will get one once we migrate to
`arrow-rs`.

A `TransportChunk` is self-describing: it contains all the data _and_
metadata needed to index it into storage.

We rely heavily on chunk-level and field-level metadata to communicate
Rerun-specific semantics over the wire, e.g. whether some columns are
already properly sorted.

The Arrow metadata system is fairly limited -- it's all untyped strings
--, but for now that seems good enough. It will be trivial to switch to
something else later, if need be.

- Fixes #1760
- Fixes #1692
- Fixes #3360 
- Fixes #1696

---

Part of a PR series to implement our new chunk-based data model on the
client-side (SDKs):
- #6437
- #6438
- #6439
- #6440
- #6441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow 📉 performance Optimization, memory use, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant