Skip to content

Conversation

@brent-statsig
Copy link

Working on adding better ability to concurrently process avro - this moves the heavy work of serializing + compressing to not require a mut anymore. This allows users to wrap the serialization logic in whatever async/processing runtime they want easily, without introducing any opinions or bloat.

  • serialize_ser does the heavy lifting, returning a private struct
  • extend_avro_serialized_buffer accepts this struct, writing it directly to the writer and performing all avro bookkeeping.

Let me know what you think :) its a lil messy factoring wise, but I think the abstraction is fairly reasonable, effective, and safe.

@martin-g martin-g requested a review from Copilot October 29, 2025 05:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds functionality to support parallel serialization of Avro data by introducing a two-phase write pattern. Users can now serialize data in parallel threads using serialize_ser, then write the pre-serialized buffers sequentially using extend_avro_serialized_buffer.

  • Introduces AvroSerializedBuffer struct to hold pre-serialized data
  • Adds serialize_ser method for thread-safe serialization without mutation
  • Adds extend_avro_serialized_buffer method to write pre-serialized buffers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +66 to +69
pub struct AvroSerializedBuffer {
buffer: Vec<u8>,
num_values: usize,
}
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new public type AvroSerializedBuffer should be exported in avro/src/lib.rs in the pub use writer:: block (around line 911) to make it accessible to library consumers. Currently users cannot import this type even though it's returned by the public serialize_ser method.

Copilot uses AI. Check for mistakes.
@Kriskras99
Copy link
Contributor

Kriskras99 commented Oct 29, 2025

I think this can be useful to have. However I don't entirely agree with the design.
I think the AvroSerializedBuffer should have it's own append function. It's also relatively easily now to create a buffer with one schema and use it for another. It would also need functions for appending Values

Maybe more something like this?:

pub struct AvroSerializedBuffer<'a> {
    schema: &'a Schema,
    // We don't want this to be a reference to the writer, as that would block any writing,
    // But cloning this can be expensive (there's a Vec inside)
    resolved_schema: ResolvedSchema<'a>,
    codec: Codec,
    num_values: usize,
    // We want to do compression in the buffer not in the writer, but want to do it as late as possible
    // for maximum compression. So we need to track what we already have compressed.
    compressed_up_to: usize,
    buffer: Vec<u8>,
}

Then we can compare the schema, resolved schema and codec when adding it back to the writer. This setup would also allow one to reset the buffer so the allocation can be reused.
@martin-g what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants