Client-side chunks 4: integrations #6441

teh-cmc · 2024-05-27T15:50:33Z

Integrate the new chunk batcher in all SDKs, and get rid of the old one.

On the backend, we make sure to deserialize incoming chunks into the old DataTables, so business can continue as usual.

Although the new batcher has a much more complicated task with all these sub-splits to manage, it is somehow already more performant than the old one 🤷‍♂️:

# this branch
cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual'
Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual
  Time (mean ± σ):      4.499 s ±  0.117 s    [User: 5.544 s, System: 1.836 s]
  Range (min … max):    4.226 s …  4.640 s    15 runs

# main
cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual'
Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual
  Time (mean ± σ):      4.407 s ±  0.773 s    [User: 8.423 s, System: 0.880 s]
  Range (min … max):    2.997 s …  6.148 s    15 runs

Notice the massive difference in user time.

Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!

To run all checks from main, comment on the PR with @rerun-bot full-check.

teh-cmc · 2024-05-29T07:47:12Z

@rerun-bot full-check

github-actions · 2024-05-29T07:47:38Z

Started a full build: https://github.com/rerun-io/rerun/actions/runs/9282238447

jleibs

Nice! This already feels like an improvement without even considering the future benefits.

This new and improved `re_format_arrow` ™️ brings two major improvements: - It is now designed to format standard Arrow dataframes (aka chunks or batches), i.e. a `Schema` and a `Chunk`. In particular: chunk-level and field-level schema metadata will now be rendered properly with the rest of the table. - Tables larger than your terminal will now do their best to fit in, while making sure to still show just enough data. E.g. here's an excerpt of a real-world Rerun dataframe from our `helix` example: ``` cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose ``` before (`main`): ![image](https://github.com/rerun-io/rerun/assets/2910679/99169b2a-d972-439d-900a-8f122a4d5ca3) and after: ![image](https://github.com/rerun-io/rerun/assets/2910679/3fe7acce-d646-4ff2-bfae-eb5073d17741) --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

…6438) Introduces the new `re_chunk` crate: > A chunk of Rerun data, encoded using Arrow. Used for logging, transport, storage and compute. Specifically, it introduces the `Chunk` type itself, and all methods and helpers related to sorting. A `Chunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. There are a lot of things that need to be sorted within a `Chunk`, and as such we must make sure to keep track of what is or isn't sorted at all times, to avoid needlessly re-sorting things everytime a chunk changes hands. This necessitates a bunch of sanity checking all over the place to make sure we never end up in undefined states. `Chunk` is not about transport, it's about providing a nice-to-work with representation when manipulating a chunk in memory. Transporting a `Chunk` happens in the next PR. - Fixes #1981 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

A `TransportChunk` is a `Chunk` that is ready for transport and/or storage. It is very cheap to go from `Chunk` to a `TransportChunk` and vice-versa. A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a stable ABI, and can be cheaply send across process boundaries. `arrow2` has no `RecordBatch` type; we will get one once we migrate to `arrow-rs`. A `TransportChunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted. The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be. - Fixes #1760 - Fixes #1692 - Fixes #3360 - Fixes #1696 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

This is a fork of the old `DataTable` batcher, and works very similarly. Like before, this batcher will micro-batch using both space and time thresholds. There are two main differences: - This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally. - Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants: ```rust /// In particular, a [`Chunk`] cannot: /// * contain data for more than one entity path /// * contain rows with different sets of timelines /// * use more than one datatype for a given component /// * contain more rows than a pre-configured threshold if one or more timelines are unsorted ``` Most of the code is the same, the real interesting piece is `PendingRow::many_into_chunks`, as well as the newly added tests. - Fixes #4431 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

teh-cmc added 🐍 Python API Python logging API 🦀 Rust API Rust logging API 🌊 C++ API C/C++ API specific labels May 27, 2024

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 9d0d4bc to 69895e0 Compare May 27, 2024 16:45

teh-cmc force-pushed the cmc/dense_chunks_4_integration branch 2 times, most recently from 18d360d to 98095a5 Compare May 27, 2024 16:58

rerun-io deleted a comment from github-actions bot May 27, 2024

teh-cmc added do-not-merge Do not merge this PR include in changelog labels May 27, 2024

teh-cmc marked this pull request as ready for review May 27, 2024 17:08

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 69895e0 to 7b87a78 Compare May 29, 2024 07:35

teh-cmc force-pushed the cmc/dense_chunks_4_integration branch from 98095a5 to 4fa737b Compare May 29, 2024 07:35

rerun-io deleted a comment from github-actions bot May 29, 2024

jleibs approved these changes May 30, 2024

View reviewed changes

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 7b87a78 to 08999d7 Compare May 31, 2024 07:56

teh-cmc force-pushed the cmc/dense_chunks_4_integration branch from 88abaec to 1292e9d Compare May 31, 2024 07:57

teh-cmc removed the do-not-merge Do not merge this PR label May 31, 2024

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 08999d7 to 22f7e61 Compare May 31, 2024 08:44

Base automatically changed from cmc/dense_chunks_3_batching to main May 31, 2024 08:46

teh-cmc added 5 commits May 31, 2024 10:48

rust integration

0d019c7

python integration

27707b5

c & cpp integrations

1407d75

annihilate old table batcher

3319efc

fix broken bench

acd8cd8

teh-cmc force-pushed the cmc/dense_chunks_4_integration branch from 1292e9d to acd8cd8 Compare May 31, 2024 08:48

teh-cmc merged commit 9a86ad5 into main May 31, 2024
33 checks passed

teh-cmc deleted the cmc/dense_chunks_4_integration branch May 31, 2024 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side chunks 4: integrations #6441

Client-side chunks 4: integrations #6441

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading

teh-cmc commented May 29, 2024

github-actions bot commented May 29, 2024

jleibs left a comment

Client-side chunks 4: integrations #6441

Client-side chunks 4: integrations #6441

Conversation

teh-cmc commented May 27, 2024 • edited by github-actions bot Loading

Checklist

teh-cmc commented May 29, 2024

github-actions bot commented May 29, 2024

jleibs left a comment

Choose a reason for hiding this comment

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading