-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Refactor arrow-avro Decoder to support partial decoding
#8100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Decoder to support partial decoding and improve decod…Decoder to support partial decoding
c3cd755 to
de7fa16
Compare
de7fa16 to
adfcb1c
Compare
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks nice. Some possible simplifications, but my main question/concern is about zero-length records causing zero bytes consumed? The code seems to keep going back and forth on whether it's allowed/possible?
arrow-avro/src/reader/mod.rs
Outdated
| Ok(n) if n > 0 => { | ||
| self.remaining_capacity -= 1; | ||
| total_consumed += n; | ||
| self.awaiting_body = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there always a fingerprint after each record? Or just a chance to see a fingerprint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's always a magic + fingerprint prefix at the start of each single object encoded record.
arrow-avro/src/reader/mod.rs
Outdated
| return Err(ArrowError::ParseError( | ||
| "Record decoder consumed 0 bytes".into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought zero-byte records were legal, and we're supposed to keep looping until the output batch is full?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think they were legal for single object encodings, but sure enough they are. So I'll remove this. Should have re-read the specs lol.
| // Decode either the block count of remaining capacity from `data` (an OCF block payload). | ||
| // | ||
| // Returns the number of bytes consumed from `data` along with the number of records decoded. | ||
| fn decode_block(&mut self, data: &[u8], count: usize) -> Result<(usize, usize), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two same-type return values with very different meanings... is it worth defining a struct for it so they have names? Or are there few enough (and always internal) callers to keep track of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That occurred to me as well. I decided to leave it that way because the only caller is Reader::read and decode_block is not public.
As an aside, I was unsure of the value a public decode_block method would offer. This is due to block encodings only existing in Object Container Files, which the Reader handles. In the future if there's demand for decoding blocks outside of the Reader, then we'd probably want to refactor the code to support Decoder::decode_block(block: Block, codec: Option<CompressionCodec>) -> Result<DecodeRes, ArrowError> or something along those lines like you pointed out. I was just concerned this would be a pre-mature optimization.
Ty! Only reason I added that back in was I didn't think zero byte encodings were legal for single object encodings, but it seems I was wrong and they are. So I'll push up a change to account for that. |
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
40c7e34 to
664d344
Compare
|
@scovich I went ahead and pushed up changes that remove the zero byte error on single object encodings and improve the Let me know what you think. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is good now, except I'm worried about robustness in the face of invalid input bytes.
arrow-avro/src/reader/mod.rs
Outdated
| fn is_incomplete_data(err: &ArrowError) -> bool { | ||
| matches!( | ||
| err, | ||
| ArrowError::ParseError(msg) | ||
| if msg.contains("Unexpected EOF") | ||
| || msg.contains("bad varint") | ||
| || msg.contains("offset overflow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after thinking more about this over the weekend --
Trying to interpret/suppress these errors will almost certainly make the decoder brittle in the face of malformed input bytes that legitimately trigger these errors. For example, we could put the decoder in an infinite loop where it keeps trying to fetch more and more bytes in hopes of eliminating the error, when the error is fully contained in the existing buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I planned to clean this up in the Schema Resolution RecordDecoder PR that's coming after #8047 get merged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up modifying is_incomplete_data to this:
fn is_incomplete_data(err: &ArrowError) -> bool {
matches!(
err,
ArrowError::ParseError(msg)
if msg.contains("Unexpected EOF")
)
}
I double checked the arrow-avro/src/reader/cursor.rs file and we should only get the Unexpected EOF error if there's too few bytes.
I'll also improve the logic in arrow-avro/src/reader/cursor.rs to support a more deliberate and less rigid implementation in a future PR. I left a comment in the code calling this out.
Let me know what you think of this approach.
arrow-avro/src/reader/mod.rs
Outdated
| } | ||
| let batch = self.active_decoder.flush()?; | ||
| self.remaining_capacity = self.batch_size; | ||
| self.apply_pending_schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flush and flush_block are identical except this call to self.apply_pending_schema?
Is there a way to deduplicate the code? Maybe a flush_internal that takes a boolean argument (which the compiler would aggressively inline away as if it were a generic parameter)?
Or just call self.apply_pending_schema unconditionally, knowing it should be a no-op during block decoding because self.pending_schema is always None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call out. I'll create a helper for:
let batch = self.active_decoder.flush()?;
self.remaining_capacity = self.batch_size;I'm wanting to keep the schema change logic completely decoupled from the block decoder/flush path for now. Just to avoid confusion for future contributors and to setup us up for any future Decoder decomposition efforts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed up changes which include a new flush_and_reset method:
fn flush_and_reset(&mut self) -> Result<Option<RecordBatch>, ArrowError> {
if self.batch_is_empty() {
return Ok(None);
}
let batch = self.active_decoder.flush()?;
self.remaining_capacity = self.batch_size;
Ok(Some(batch))
}It abstracted out most of the flush logic and flush_block is now just a public wrapper for that new method.
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
7096c0e to
9456a11
Compare
9456a11 to
14303c3
Compare
Ty for the fast follow-up review. Let me know what you think of these latest changes! |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| } | ||
|
|
||
| /// Returns true if the decoder has not decoded any batches yet. | ||
| pub fn batch_is_empty(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Not sure it needs to be pub? But I guess it's plenty well-defined (batch_is_empty is true if flush would return None), so no harm leaving it public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any harm in leaving it pub. Figured it maybe useful for someone, plus batch_is_full was already pub as well so I was trying to keep it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize we already had a pub batch_is_full method. Makes sense!
|
Thanks for the review and code @jecsand838 and @scovich -- let's keep the train moving |
Which issue does this PR close?
Rationale for this change
Decoding Avro single-object encoded streams was brittle when data arrived in partial chunks (e.g., from async or networked sources). The old implementation relied on ad‑hoc prefix handling and assumed a full record would be available, producing hard errors for otherwise normal “incomplete buffer” situations. Additionally, the Avro OCF (Object Container File) path iterated record‑by‑record through a shared row decoder, adding overhead.
This PR introduces a small state machine for single‑object decoding and a block‑aware path for OCF, making streaming more robust and OCF decoding more efficient while preserving the public API surface.
What changes are included in this PR?
Single‑object decoding (streaming)
expect_prefix,handle_prefix,handle_fingerprint) with an explicit state machine:enum DecoderState { Magic, Fingerprint, Record, SchemaChange, Finished }.Decodernow tracksstate,bytes_remaining, and afingerprint_bufto incrementally assemble the fingerprint.is_incomplete_data(&ArrowError) -> boolto treat “Unexpected EOF”, “bad varint”, and “offset overflow” as incomplete input instead of fatal errors.Decoder::decode(&[u8]) -> Result<usize, ArrowError>:Decoder::flush()to emit a batch only when rows are ready and to transition the state correctly (including a stagedSchemaChange).OCF (Object Container File) decoding
Decoderused byReader:decode_block(&[u8], count: usize) -> Result<(consumed, records_decoded), ArrowError>flush_block() -> Result<Option<RecordBatch>, ArrowError>Readernow tracksblock_countand decodes up to the number of records in the current block, reducing per‑row overhead and improving throughput.ReaderBuilder::buildinitializes the newblock_countpath.API / struct adjustments
expect_prefixflag fromDecoder; behavior is driven by the state machine.ReaderBuilder::make_decoder_with_partsupdated accordingly (no behavior change to public builder methods).Reader,Decoder, orReaderBuilder.Tests
test_two_messages_same_schematest_two_messages_schema_switchtest_split_message_across_chunksMagic→Fingerprint, etc.) and new error messages.Are these changes tested?
Yes.
Readernow usesdecode_block/flush_block.Are there any user-facing changes?
N/A