Support sending a `DataCell`'s size (& other metadata) over the wire #1760

teh-cmc · 2023-04-04T10:37:38Z

This would allow us to compute the size of DataCells (a very costly operation) on the clients and therefore:

distribute the load to the clients rather than submerging the server
do the computation on the batching cold path, where we have all the time in the world

The text was updated successfully, but these errors were encountered:

teh-cmc · 2023-04-07T08:04:27Z

That would also serialize the cell sizes to disk when saving the store to an rrd file, meaning reloading it later on will be much faster.

teh-cmc · 2023-04-18T07:13:37Z

The size computation is now happening on the clients' no matter what (we need the value for the size_bytes trigger of the batching system), so not sending it over the wire is a literal waste of compute resources.

A `TransportChunk` is a `Chunk` that is ready for transport and/or storage. It is very cheap to go from `Chunk` to a `TransportChunk` and vice-versa. A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a stable ABI, and can be cheaply send across process boundaries. `arrow2` has no `RecordBatch` type; we will get one once we migrate to `arrow-rs`. A `TransportChunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted. The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be. - Fixes #1760 - Fixes #1692 - Fixes #3360 - Fixes #1696 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

teh-cmc added 🏹 arrow concerning arrow 📉 performance Optimization, memory use, etc labels Apr 4, 2023

teh-cmc changed the title ~~Support sending a DataCell's size over the wire~~ Support sending a DataCell's size (& other metadata) over the wire Apr 4, 2023

teh-cmc mentioned this issue Apr 4, 2023

Datastore revamp 3: efficient incremental stats #1739

Merged

5 tasks

teh-cmc mentioned this issue Apr 18, 2023

Tracking issue: Arrow & components performance/QOL improvements #1899

Closed

7 tasks

teh-cmc mentioned this issue Apr 27, 2023

SDK batching/revamp 1: impl DataTableBatcher #1980

Merged

teh-cmc self-assigned this May 16, 2024

teh-cmc mentioned this issue May 27, 2024

Client-side chunks 2: introduce TransportChunk #6439

Merged

5 tasks

teh-cmc closed this as completed in #6439 May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sending a `DataCell`'s size (& other metadata) over the wire #1760

Support sending a `DataCell`'s size (& other metadata) over the wire #1760

teh-cmc commented Apr 4, 2023

teh-cmc commented Apr 7, 2023

teh-cmc commented Apr 18, 2023

Support sending a DataCell's size (& other metadata) over the wire #1760

Support sending a DataCell's size (& other metadata) over the wire #1760

Comments

teh-cmc commented Apr 4, 2023

teh-cmc commented Apr 7, 2023

teh-cmc commented Apr 18, 2023

Support sending a `DataCell`'s size (& other metadata) over the wire #1760

Support sending a `DataCell`'s size (& other metadata) over the wire #1760