-
Notifications
You must be signed in to change notification settings - Fork 38
RFC - Split serialization logic from writing logic to make concurrent writes faster #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds functionality to support parallel serialization of Avro data by introducing a two-phase write pattern. Users can now serialize data in parallel threads using serialize_ser, then write the pre-serialized buffers sequentially using extend_avro_serialized_buffer.
- Introduces
AvroSerializedBufferstruct to hold pre-serialized data - Adds
serialize_sermethod for thread-safe serialization without mutation - Adds
extend_avro_serialized_buffermethod to write pre-serialized buffers
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub struct AvroSerializedBuffer { | ||
| buffer: Vec<u8>, | ||
| num_values: usize, | ||
| } |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new public type AvroSerializedBuffer should be exported in avro/src/lib.rs in the pub use writer:: block (around line 911) to make it accessible to library consumers. Currently users cannot import this type even though it's returned by the public serialize_ser method.
Co-authored-by: Copilot <[email protected]>
|
I think this can be useful to have. However I don't entirely agree with the design. Maybe more something like this?: pub struct AvroSerializedBuffer<'a> {
schema: &'a Schema,
// We don't want this to be a reference to the writer, as that would block any writing,
// But cloning this can be expensive (there's a Vec inside)
resolved_schema: ResolvedSchema<'a>,
codec: Codec,
num_values: usize,
// We want to do compression in the buffer not in the writer, but want to do it as late as possible
// for maximum compression. So we need to track what we already have compressed.
compressed_up_to: usize,
buffer: Vec<u8>,
}Then we can compare the schema, resolved schema and codec when adding it back to the writer. This setup would also allow one to reset the buffer so the allocation can be reused. |
Working on adding better ability to concurrently process avro - this moves the heavy work of serializing + compressing to not require a
mutanymore. This allows users to wrap the serialization logic in whatever async/processing runtime they want easily, without introducing any opinions or bloat.serialize_serdoes the heavy lifting, returning a private structextend_avro_serialized_bufferaccepts this struct, writing it directly to the writer and performing all avro bookkeeping.Let me know what you think :) its a lil messy factoring wise, but I think the abstraction is fairly reasonable, effective, and safe.