Client-side chunks 3: micro-batching #6440

teh-cmc · 2024-05-27T15:38:13Z

This is a fork of the old DataTable batcher, and works very similarly.

Like before, this batcher will micro-batch using both space and time thresholds.
There are two main differences:

This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally.

Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants:

/// In particular, a [`Chunk`] cannot:
/// * contain data for more than one entity path
/// * contain rows with different sets of timelines
/// * use more than one datatype for a given component
/// * contain more rows than a pre-configured threshold if one or more timelines are unsorted

Most of the code is the same, the real interesting piece is PendingRow::many_into_chunks, as well as the newly added tests.

Fixes Batcher should sort data in addition to batching #4431

Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!

To run all checks from main, comment on the PR with @rerun-bot full-check.

jleibs · 2024-05-30T16:16:34Z

crates/re_chunk/src/batcher.rs

+        flush_tick: Duration::MAX,
+        flush_num_bytes: u64::MAX,
+        flush_num_rows: u64::MAX,
+        max_chunk_rows_if_unsorted: 256,


The interaction between this and "never" seems a bit odd. Is this also considered one of the built-in invariants?

Yes -- in general, only global time and space act as batching thresholds, everything else just splits the result further down into smaller pieces.

crates/re_chunk/src/batcher.rs

jleibs · 2024-05-30T16:25:49Z

crates/re_chunk/src/batcher.rs

+                            config(&acc.pending_rows);
+                        }
+
+                        if acc.pending_rows.len() as u64 >= config.flush_num_rows {


A nice side-effect of this refactor is the potential for per-entity-flushing config in the future.

This new and improved `re_format_arrow` ™️ brings two major improvements: - It is now designed to format standard Arrow dataframes (aka chunks or batches), i.e. a `Schema` and a `Chunk`. In particular: chunk-level and field-level schema metadata will now be rendered properly with the rest of the table. - Tables larger than your terminal will now do their best to fit in, while making sure to still show just enough data. E.g. here's an excerpt of a real-world Rerun dataframe from our `helix` example: ``` cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose ``` before (`main`): ![image](https://github.com/rerun-io/rerun/assets/2910679/99169b2a-d972-439d-900a-8f122a4d5ca3) and after: ![image](https://github.com/rerun-io/rerun/assets/2910679/3fe7acce-d646-4ff2-bfae-eb5073d17741) --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

…6438) Introduces the new `re_chunk` crate: > A chunk of Rerun data, encoded using Arrow. Used for logging, transport, storage and compute. Specifically, it introduces the `Chunk` type itself, and all methods and helpers related to sorting. A `Chunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. There are a lot of things that need to be sorted within a `Chunk`, and as such we must make sure to keep track of what is or isn't sorted at all times, to avoid needlessly re-sorting things everytime a chunk changes hands. This necessitates a bunch of sanity checking all over the place to make sure we never end up in undefined states. `Chunk` is not about transport, it's about providing a nice-to-work with representation when manipulating a chunk in memory. Transporting a `Chunk` happens in the next PR. - Fixes #1981 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

A `TransportChunk` is a `Chunk` that is ready for transport and/or storage. It is very cheap to go from `Chunk` to a `TransportChunk` and vice-versa. A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a stable ABI, and can be cheaply send across process boundaries. `arrow2` has no `RecordBatch` type; we will get one once we migrate to `arrow-rs`. A `TransportChunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted. The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be. - Fixes #1760 - Fixes #1692 - Fixes #3360 - Fixes #1696 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

Integrate the new chunk batcher in all SDKs, and get rid of the old one. On the backend, we make sure to deserialize incoming chunks into the old `DataTable`s, so business can continue as usual. Although the new batcher has a much more complicated task with all these sub-splits to manage, it is somehow already more performant than the old one 🤷‍♂️: ```bash # this branch cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.499 s ± 0.117 s [User: 5.544 s, System: 1.836 s] Range (min … max): 4.226 s … 4.640 s 15 runs # main cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.407 s ± 0.773 s [User: 8.423 s, System: 0.880 s] Range (min … max): 2.997 s … 6.148 s 15 runs ``` Notice the massive difference in user time. --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

teh-cmc added 📉 performance Optimization, memory use, etc do-not-merge Do not merge this PR include in changelog 🔩 data model 🪵 Log & send APIs Affects the user-facing API for all languages labels May 27, 2024

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 07fdd69 to 9d0d4bc Compare May 27, 2024 15:39

This was referenced May 27, 2024

Client-side chunks 0: improved arrow chunk formatters #6437

Merged

Client-side chunks 1: introduce Chunk and its suffle/sort routines #6438

Merged

Client-side chunks 2: introduce TransportChunk #6439

Merged

Client-side chunks 4: integrations #6441

Merged

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from ced515c to 8342ebb Compare May 27, 2024 16:45

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 9d0d4bc to 69895e0 Compare May 27, 2024 16:45

teh-cmc marked this pull request as ready for review May 27, 2024 16:53

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 8342ebb to 4a9b5cd Compare May 29, 2024 07:33

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 69895e0 to 7b87a78 Compare May 29, 2024 07:35

jleibs approved these changes May 30, 2024

View reviewed changes

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 4a9b5cd to 05cdde7 Compare May 31, 2024 07:46

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 7b87a78 to 08999d7 Compare May 31, 2024 07:56

teh-cmc removed the do-not-merge Do not merge this PR label May 31, 2024

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 05cdde7 to 3be1f77 Compare May 31, 2024 08:42

Base automatically changed from cmc/dense_chunks_2_transport to main May 31, 2024 08:42

teh-cmc added 2 commits May 31, 2024 10:44

implement chunk micro-batcher

f551246

review

22f7e61

teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 08999d7 to 22f7e61 Compare May 31, 2024 08:44

teh-cmc merged commit fde4a87 into main May 31, 2024
27 of 28 checks passed

teh-cmc deleted the cmc/dense_chunks_3_batching branch May 31, 2024 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side chunks 3: micro-batching #6440

Client-side chunks 3: micro-batching #6440

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading

jleibs May 30, 2024

teh-cmc May 31, 2024

jleibs May 30, 2024

Client-side chunks 3: micro-batching #6440

Client-side chunks 3: micro-batching #6440

Conversation

teh-cmc commented May 27, 2024 • edited by github-actions bot Loading

Checklist

jleibs May 30, 2024

Choose a reason for hiding this comment

teh-cmc May 31, 2024

Choose a reason for hiding this comment

jleibs May 30, 2024

Choose a reason for hiding this comment

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading