RFC - Split serialization logic from writing logic to make concurrent writes faster #326

brent-statsig · 2025-10-28T22:23:09Z

Working on adding better ability to concurrently process avro - this moves the heavy work of serializing + compressing to not require a mut anymore. This allows users to wrap the serialization logic in whatever async/processing runtime they want easily, without introducing any opinions or bloat.

serialize_ser does the heavy lifting, returning a private struct
extend_avro_serialized_buffer accepts this struct, writing it directly to the writer and performing all avro bookkeeping.

Let me know what you think :) its a lil messy factoring wise, but I think the abstraction is fairly reasonable, effective, and safe.

Copilot

Pull Request Overview

This PR adds functionality to support parallel serialization of Avro data by introducing a two-phase write pattern. Users can now serialize data in parallel threads using serialize_ser, then write the pre-serialized buffers sequentially using extend_avro_serialized_buffer.

Introduces AvroSerializedBuffer struct to hold pre-serialized data
Adds serialize_ser method for thread-safe serialization without mutation
Adds extend_avro_serialized_buffer method to write pre-serialized buffers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

avro/src/writer.rs

Copilot · 2025-10-29T05:52:12Z

avro/src/writer.rs

+pub struct AvroSerializedBuffer {
+    buffer: Vec<u8>,
+    num_values: usize,
+}


The new public type AvroSerializedBuffer should be exported in avro/src/lib.rs in the pub use writer:: block (around line 911) to make it accessible to library consumers. Currently users cannot import this type even though it's returned by the public serialize_ser method.

Co-authored-by: Copilot <[email protected]>

Kriskras99 · 2025-10-29T09:02:31Z

I think this can be useful to have. However I don't entirely agree with the design.
I think the AvroSerializedBuffer should have it's own append function. It's also relatively easily now to create a buffer with one schema and use it for another. It would also need functions for appending Values

Maybe more something like this?:

pub struct AvroSerializedBuffer<'a> {
    schema: &'a Schema,
    // We don't want this to be a reference to the writer, as that would block any writing,
    // But cloning this can be expensive (there's a Vec inside)
    resolved_schema: ResolvedSchema<'a>,
    codec: Codec,
    num_values: usize,
    // We want to do compression in the buffer not in the writer, but want to do it as late as possible
    // for maximum compression. So we need to track what we already have compressed.
    compressed_up_to: usize,
    buffer: Vec<u8>,
}

Then we can compare the schema, resolved schema and codec when adding it back to the writer. This setup would also allow one to reset the buffer so the allocation can be reused.
@martin-g what do you think?

brent-statsig added 5 commits October 28, 2025 10:04

Make ser crate public

4e32ae5

Added ability to disable header writes, so buffers can be composed

bea500c

wip

1870bc5

make AvroSerializedBuffer private

a335ec4

revert unneeded changes

6eb7747

brent-statsig mentioned this pull request Oct 28, 2025

Serialization performance tuning how-to? #322

Open

Make AvroSerializedBuffer pub again

d2e585a

martin-g requested a review from Copilot October 29, 2025 05:49

Copilot AI reviewed Oct 29, 2025

View reviewed changes

brent-statsig and others added 2 commits October 28, 2025 22:57

Update avro/src/writer.rs

25a4ee3

Co-authored-by: Copilot <[email protected]>

Updated docs and copilot review

06e27e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC - Split serialization logic from writing logic to make concurrent writes faster #326

RFC - Split serialization logic from writing logic to make concurrent writes faster #326

Uh oh!

brent-statsig commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 29, 2025

Uh oh!

Kriskras99 commented Oct 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RFC - Split serialization logic from writing logic to make concurrent writes faster #326

Are you sure you want to change the base?

RFC - Split serialization logic from writing logic to make concurrent writes faster #326

Uh oh!

Conversation

brent-statsig commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Kriskras99 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kriskras99 commented Oct 29, 2025 •

edited

Loading