Significantly improve C++ logging performance by using C FFI instead of arrow IPC #4273

Wumpf · 2023-11-20T13:56:43Z

What

Fixes Don't go to arrow IPC and back on C++ -> Rust, use C FFI instead #4255
Depends on Use different C++ build directories for debug/release and roundtrips #4282

So far we've been using the arrow IPC format to hand data from C++ to C/Rust. It turned out that this incurred a major performance cost on our api. Instead, using the C FFI interface (which still isn't exactly free) gives us a major performance boost in all of today's logging benchmarks:

large point cloud: 0.58s -> 0.15s
many points: 17.10 -> 7.52s
large images: 2.90s -> 0.57s
Execute times without prepare, single sample on before, median of three on after, all ran in a single process

for comparison, numbers on main for Rust:

large point clouds: 0.82s
many points: 3.87s
large images: 1.00s
Execute times without prepare via Puffin, single sample, all ran in a single process. As always these comparisons are very tricky. Also there's lots of noise!

There's still some significant discrepancy on many points. One likely source of this is repeated schema transfer. Need to do some more profiling before continuing.

This PR also simplifies the serialization pipeline a little bit by removing SerializedComponentBatch, replacing it in favor of the existing rerun::DataCell in a few places. Left a comment in the code on the next steps towards evolving it towards an interface that's more similar to how things work in Rust.

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested demo.rerun.io (if applicable)
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG

crates/rerun_c/src/lib.rs

teh-cmc

LGTM, but can't we statically allocate schemas and pass static pointers around?

rerun_cpp/src/rerun/data_cell.hpp

teh-cmc · 2023-11-21T08:29:20Z

crates/rerun_c/src/rerun.h

+    /// TODO(andreas): Use a schema registry that identifies this and the component name with a unique schema ID.
+    ArrowSchema schema;


Can't we just statically allocate the schema and always pass the same pointer to it in here? Why do we need a registry?

see comment below: I want to also remove the Rust sided cost of FFI schema parsing & building. Also practically speaking we'd still need to allocate it somewhere since we don't get this schema allocated by libarrow, it only fills out a struct (which in turn holds on to allocs)

crates/rerun_c/src/rerun.h

teh-cmc · 2023-11-21T08:34:53Z

crates/rerun_c/src/lib.rs

+    /// TODO(andreas): Use a schema registry.
+    pub schema: arrow2::ffi::ArrowSchema,


Same comment here; I'd expect this to be a static pointer?

I was actually planning to have the registry side over on Rust and give out a handle: Arrow2 doesn't take ownership of the ArrowSchema and instead builds up a new Rust data structure. So I figure if we cache schemas then let's go all the way.
I'll do so in a follow-up PR

Wumpf · 2023-11-21T09:00:58Z

but can't we statically allocate schemas and pass static pointers around?

Yes, I wanted to do that as a follow-up! :)

…atch the desired datatype (#4280) ### What * Prerequisite for #4273 Previously, not knowing about `arrow::MakeBuilder`, we created the hierarchy of arrow array builders manually in a utility method dubbed `new_arrow_array_builder`. While working on C FFI it turned out that this way we end up with an arrow array whose data type doesn't have the correct nullability settings - in other words, on creating the arrow array builders we would have needed to pass the exact data types. This is what I tried at first and it worked, but then I noticed that there's already `arrow::MakeBuilder` which internally uses a bunch of rolled out switch/case statements to create the hierarchy programmatically. It turns out that this in combination with the work done on switching from IPC to C FFI gives another significant speed-up. When tested in isolation however (i.e. still on IPC), the gains seem to be rather modest and within the (large!) measurement noise. ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/4280) (if applicable) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG - [PR Build Summary](https://build.rerun.io/pr/4280) - [Docs preview](https://rerun.io/preview/030bafa44afbd77f510d15db7c3d254386f1aae9/docs)  - [Examples preview](https://rerun.io/preview/030bafa44afbd77f510d15db7c3d254386f1aae9/examples)  - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)

…ing a component type registry (#4296) ### What * Fixes #4287 * Follow-up to #4273 As expected, not doing the C++ datatype -> C FFI schema -> Rust datatype roundtrip for each log call helps perf quite a bit, especially when we do a lot of smaller log calls. The registry a single RwLock protected Vec (we never deregister) which is exposed via a single c entry point. On the C++ side we use the local `static` variable mechanism for threadsafe lazy registration (slight codegen adjustment). Indicator components had some special handling before and were refactored to fit in this system - in the process I made their arrow array shared across all instantiations, further cutting down on per-log work. --- Benchmark results: * large point cloud: `0.15s` -> `0.14s` * many points: `7.52s` -> `4.52s` * large images: `0.57s` -> `0.51s` Old values from previous PR. New values are median over three runs, single executable run (this makes more and more of a difference with all these registries!), timings without prepare step, same M1 macbook. A quick look over the profiler for running `log_benchmark points3d_many_individual` in isolation tells us that of the actual benchmark running time we spend.. * 35% of the the time in `rr_recording_stream_log` (of which in turn 20%, so 7% overall, is still arrow FFI translation of the array!!) * 30% in the various `to_data_cell` methods * 10% in exporting arrow arrays to C FFI * 6% in setting the time * the rest in various allocations along the way (taken via `Instruments` on my Mac) <img width="969" alt="image" src="https://github.com/rerun-io/rerun/assets/1220815/5632589f-52b1-4e92-b7a0-1482e69528ad"> --- ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/4296) (if applicable) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG - [PR Build Summary](https://build.rerun.io/pr/4296) - [Docs preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/docs)  - [Examples preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/examples)  - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)

Wumpf added 🚀 performance Optimization, memory use, etc 🌊 C++ API C/C++ API specific include in changelog labels Nov 20, 2023

Wumpf commented Nov 20, 2023

View reviewed changes

crates/rerun_c/src/lib.rs Outdated Show resolved Hide resolved

Wumpf marked this pull request as draft November 20, 2023 14:41

Wumpf mentioned this pull request Nov 20, 2023

Use arrow::MakeBuilder to create arrow array builder that exactly match the desired datatype #4280

Merged

4 tasks

Wumpf force-pushed the andreas/cpp/use-c-ffi branch from ca6a0cf to b54b195 Compare November 20, 2023 17:54

Wumpf changed the base branch from main to andreas/cpp/fix-arrow-array-datatypes November 20, 2023 17:54

Wumpf marked this pull request as ready for review November 20, 2023 17:59

teh-cmc self-requested a review November 21, 2023 08:18

teh-cmc approved these changes Nov 21, 2023

View reviewed changes

Base automatically changed from andreas/cpp/fix-arrow-array-datatypes to main November 21, 2023 09:06

Wumpf added 11 commits November 21, 2023 10:09

Working prototype of C FFI

968c867

remove C++ ipc handling and arrow. Build arrow without ipc

12e0825

Merge SerializedComponentBatch into DataCell

a1f5536

remove datatype from data cell

9fc829b

re-enable disabled tests

1ae5ad8

comment gaps, rename arrow abi

d3c70e8

clang format arrow data interface header

11b9c0d

vscode debug instruction for the minimal example

9374ce4

clarify & document memory ownership of passed arrow C FFI structs

b6cb076

fix fallthrough warning that suddenly popped up (why now?)

2f41a6d

comment fixes

cb476d8

Wumpf force-pushed the andreas/cpp/use-c-ffi branch from e84f1b7 to cb476d8 Compare November 21, 2023 09:10

Wumpf merged commit b5918ff into main Nov 21, 2023
42 checks passed

Wumpf deleted the andreas/cpp/use-c-ffi branch November 21, 2023 09:59

This was referenced Nov 21, 2023

Speed up C++ logging for many individual log calls #4287

Closed

Further improve C++ logging for many individual log calls by introducing a component type registry #4296

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly improve C++ logging performance by using C FFI instead of arrow IPC #4273

Significantly improve C++ logging performance by using C FFI instead of arrow IPC #4273

Wumpf commented Nov 20, 2023 •

edited by github-actions bot

Loading

teh-cmc left a comment

teh-cmc Nov 21, 2023

Wumpf Nov 21, 2023

teh-cmc Nov 21, 2023

Wumpf Nov 21, 2023

Wumpf commented Nov 21, 2023

		/// TODO(andreas): Use a schema registry that identifies this and the component name with a unique schema ID.
		ArrowSchema schema;

		/// TODO(andreas): Use a schema registry.
		pub schema: arrow2::ffi::ArrowSchema,

Significantly improve C++ logging performance by using C FFI instead of arrow IPC #4273

Significantly improve C++ logging performance by using C FFI instead of arrow IPC #4273

Conversation

Wumpf commented Nov 20, 2023 • edited by github-actions bot Loading

What

Checklist

teh-cmc left a comment

Choose a reason for hiding this comment

teh-cmc Nov 21, 2023

Choose a reason for hiding this comment

Wumpf Nov 21, 2023

Choose a reason for hiding this comment

teh-cmc Nov 21, 2023

Choose a reason for hiding this comment

Wumpf Nov 21, 2023

Choose a reason for hiding this comment

Wumpf commented Nov 21, 2023

Wumpf commented Nov 20, 2023 •

edited by github-actions bot

Loading