Refactor arrow-avro `Decoder` to support partial decoding #8100

jecsand838 · 2025-08-09T05:54:26Z

Which issue does this PR close?

Part of Add Avro Support #4886

Rationale for this change

Decoding Avro single-object encoded streams was brittle when data arrived in partial chunks (e.g., from async or networked sources). The old implementation relied on ad‑hoc prefix handling and assumed a full record would be available, producing hard errors for otherwise normal “incomplete buffer” situations. Additionally, the Avro OCF (Object Container File) path iterated record‑by‑record through a shared row decoder, adding overhead.

This PR introduces a small state machine for single‑object decoding and a block‑aware path for OCF, making streaming more robust and OCF decoding more efficient while preserving the public API surface.

What changes are included in this PR?

Single‑object decoding (streaming)

Replace ad‑hoc prefix parsing (expect_prefix, handle_prefix, handle_fingerprint) with an explicit state machine:
- New enum DecoderState { Magic, Fingerprint, Record, SchemaChange, Finished }.
- Decoder now tracks state, bytes_remaining, and a fingerprint_buf to incrementally assemble the fingerprint.
New helper is_incomplete_data(&ArrowError) -> bool to treat “Unexpected EOF”, “bad varint”, and “offset overflow” as incomplete input instead of fatal errors.
Reworked Decoder::decode(&[u8]) -> Result<usize, ArrowError>:
- Consumes data according to the state machine.
- Cleanly returns when more bytes are needed (no spurious errors for partial chunks).
- Defers schema switching until after flushing currently decoded rows.
Updated Decoder::flush() to emit a batch only when rows are ready and to transition the state correctly (including a staged SchemaChange).

OCF (Object Container File) decoding

Add block‑aware decoding methods on Decoder used by Reader:
- decode_block(&[u8], count: usize) -> Result<(consumed, records_decoded), ArrowError>
- flush_block() -> Result<Option<RecordBatch>, ArrowError>
Reader now tracks block_count and decodes up to the number of records in the current block, reducing per‑row overhead and improving throughput.
ReaderBuilder::build initializes the new block_count path.

API / struct adjustments

Remove internal expect_prefix flag from Decoder; behavior is driven by the state machine.
ReaderBuilder::make_decoder_with_parts updated accordingly (no behavior change to public builder methods).
No public API signature changes for Reader, Decoder, or ReaderBuilder.

Tests

Add targeted streaming tests:
- test_two_messages_same_schema
- test_two_messages_schema_switch
- test_split_message_across_chunks
Update prefix‑handling tests to validate state transitions (Magic → Fingerprint, etc.) and new error messages.
Retain and exercise existing suites (types, lists, nested structures, decimals, enums, strict mode) with minimal adjustments.

Are these changes tested?

Yes.

New unit tests cover:
- Multi‑message streams with/without schema switches
- Messages split across chunk boundaries
- Incremental prefix/fingerprint parsing
Existing tests continue to cover OCF reading, compression, complex/nested types, strict mode, etc.
The new OCF path is exercised by the unchanged OCF tests since Reader now uses decode_block/flush_block.

Are there any user-facing changes?

N/A

jecsand838 · 2025-08-09T05:58:38Z

@scovich @alamb Here's the follow-up PR with the partial decoding enhancements along with the different paths for file and single object decoding.

…ing logic.

scovich

Overall looks nice. Some possible simplifications, but my main question/concern is about zero-length records causing zero bytes consumed? The code seems to keep going back and forth on whether it's allowed/possible?

arrow-avro/src/reader/mod.rs

scovich · 2025-08-10T03:30:56Z

arrow-avro/src/reader/mod.rs

+                Ok(n) if n > 0 => {
+                    self.remaining_capacity -= 1;
+                    total_consumed += n;
+                    self.awaiting_body = false;


Is there always a fingerprint after each record? Or just a chance to see a fingerprint?

There's always a magic + fingerprint prefix at the start of each single object encoded record.

scovich · 2025-08-10T03:32:42Z

arrow-avro/src/reader/mod.rs

+                    return Err(ArrowError::ParseError(
+                        "Record decoder consumed 0 bytes".into(),


I thought zero-byte records were legal, and we're supposed to keep looping until the output batch is full?

I didn't think they were legal for single object encodings, but sure enough they are. So I'll remove this. Should have re-read the specs lol.

arrow-avro/src/reader/mod.rs

scovich · 2025-08-10T04:34:08Z

arrow-avro/src/reader/mod.rs

+    // Decode either the block count of remaining capacity from `data` (an OCF block payload).
+    //
+    // Returns the number of bytes consumed from `data` along with the number of records decoded.
+    fn decode_block(&mut self, data: &[u8], count: usize) -> Result<(usize, usize), ArrowError> {


Two same-type return values with very different meanings... is it worth defining a struct for it so they have names? Or are there few enough (and always internal) callers to keep track of it?

That occurred to me as well. I decided to leave it that way because the only caller is Reader::read and decode_block is not public.

As an aside, I was unsure of the value a public decode_block method would offer. This is due to block encodings only existing in Object Container Files, which the Reader handles. In the future if there's demand for decoding blocks outside of the Reader, then we'd probably want to refactor the code to support Decoder::decode_block(block: Block, codec: Option<CompressionCodec>) -> Result<DecodeRes, ArrowError> or something along those lines like you pointed out. I was just concerned this would be a pre-mature optimization.

jecsand838 · 2025-08-10T19:04:52Z

Overall looks nice. Some possible simplifications, but my main question/concern is about zero-length records causing zero bytes consumed? The code seems to keep going back and forth on whether it's allowed/possible?

Ty!

Only reason I added that back in was I didn't think zero byte encodings were legal for single object encodings, but it seems I was wrong and they are. So I'll push up a change to account for that.

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

jecsand838 · 2025-08-10T19:53:52Z

@scovich I went ahead and pushed up changes that remove the zero byte error on single object encodings and improve the Decoder::decode method's readability.

Let me know what you think.

scovich

Code is good now, except I'm worried about robustness in the face of invalid input bytes.

scovich · 2025-08-11T12:17:03Z

arrow-avro/src/reader/mod.rs

+fn is_incomplete_data(err: &ArrowError) -> bool {
+    matches!(
+        err,
+        ArrowError::ParseError(msg)
+            if msg.contains("Unexpected EOF")
+            || msg.contains("bad varint")
+            || msg.contains("offset overflow")


Hmm, after thinking more about this over the weekend --

Trying to interpret/suppress these errors will almost certainly make the decoder brittle in the face of malformed input bytes that legitimately trigger these errors. For example, we could put the decoder in an infinite loop where it keeps trying to fetch more and more bytes in hopes of eliminating the error, when the error is fully contained in the existing buffer.

That's fair. I planned to clean this up in the Schema Resolution RecordDecoder PR that's coming after #8047 get merged in.

I ended up modifying is_incomplete_data to this:

fn is_incomplete_data(err: &ArrowError) -> bool { matches!( err, ArrowError::ParseError(msg) if msg.contains("Unexpected EOF") ) }

I double checked the arrow-avro/src/reader/cursor.rs file and we should only get the Unexpected EOF error if there's too few bytes.

I'll also improve the logic in arrow-avro/src/reader/cursor.rs to support a more deliberate and less rigid implementation in a future PR. I left a comment in the code calling this out.

Let me know what you think of this approach.

arrow-avro/src/reader/mod.rs

scovich · 2025-08-11T12:30:10Z

arrow-avro/src/reader/mod.rs

+        }
+        let batch = self.active_decoder.flush()?;
+        self.remaining_capacity = self.batch_size;
+        self.apply_pending_schema();


flush and flush_block are identical except this call to self.apply_pending_schema?
Is there a way to deduplicate the code? Maybe a flush_internal that takes a boolean argument (which the compiler would aggressively inline away as if it were a generic parameter)?

Or just call self.apply_pending_schema unconditionally, knowing it should be a no-op during block decoding because self.pending_schema is always None?

That's a good call out. I'll create a helper for:

let batch = self.active_decoder.flush()?; self.remaining_capacity = self.batch_size;

I'm wanting to keep the schema change logic completely decoupled from the block decoder/flush path for now. Just to avoid confusion for future contributors and to setup us up for any future Decoder decomposition efforts.

I just pushed up changes which include a new flush_and_reset method:

fn flush_and_reset(&mut self) -> Result<Option<RecordBatch>, ArrowError> { if self.batch_is_empty() { return Ok(None); } let batch = self.active_decoder.flush()?; self.remaining_capacity = self.batch_size; Ok(Some(batch)) }

It abstracted out most of the flush logic and flush_block is now just a public wrapper for that new method.

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

jecsand838 · 2025-08-11T20:31:47Z

Code is good now, except I'm worried about robustness in the face of invalid input bytes.

Ty for the fast follow-up review. Let me know what you think of these latest changes!

scovich

LGTM!

scovich · 2025-08-11T22:10:43Z

arrow-avro/src/reader/mod.rs

    }

+    /// Returns true if the decoder has not decoded any batches yet.
+    pub fn batch_is_empty(&self) -> bool {


nit: Not sure it needs to be pub? But I guess it's plenty well-defined (batch_is_empty is true if flush would return None), so no harm leaving it public?

I didn't see any harm in leaving it pub. Figured it maybe useful for someone, plus batch_is_full was already pub as well so I was trying to keep it consistent.

Ah, I didn't realize we already had a pub batch_is_full method. Makes sense!

alamb · 2025-08-12T17:23:46Z

Thanks for the review and code @jecsand838 and @scovich -- let's keep the train moving

github-actions bot added the arrow Changes to the arrow crate label Aug 9, 2025

jecsand838 changed the title ~~Refactor Avro Decoder to support partial decoding and improve decod…~~ Refactor arrow-avro Decoder to support partial decoding Aug 9, 2025

Refactor Avro Decoder to support partial decoding and improve decod…

0878364

…ing logic.

jecsand838 force-pushed the avro-reader-chunking branch 2 times, most recently from c3cd755 to de7fa16 Compare August 9, 2025 22:21

Refactored state machine approach to get 2x performance improvement.

adfcb1c

jecsand838 force-pushed the avro-reader-chunking branch from de7fa16 to adfcb1c Compare August 9, 2025 22:23

scovich reviewed Aug 10, 2025

View reviewed changes

jecsand838 and others added 6 commits August 10, 2025 14:04

Update arrow-avro/src/reader/mod.rs

1c7279c

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

Update arrow-avro/src/reader/mod.rs

8237694

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

Update arrow-avro/src/reader/mod.rs

92113ff

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

Update arrow-avro/src/reader/mod.rs

23ac5fa

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

Update arrow-avro/src/reader/mod.rs

3529d0f

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

Address PR Comments

664d344

jecsand838 force-pushed the avro-reader-chunking branch from 40c7e34 to 664d344 Compare August 10, 2025 19:48

jecsand838 requested a review from scovich August 10, 2025 19:53

scovich reviewed Aug 11, 2025

View reviewed changes

jecsand838 and others added 2 commits August 11, 2025 12:22

Update arrow-avro/src/reader/mod.rs

cf40f8f

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

Update arrow-avro/src/reader/mod.rs

31c114a

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

jecsand838 force-pushed the avro-reader-chunking branch from 7096c0e to 9456a11 Compare August 11, 2025 20:22

Address PR Comments

14303c3

jecsand838 force-pushed the avro-reader-chunking branch from 9456a11 to 14303c3 Compare August 11, 2025 20:25

jecsand838 requested a review from scovich August 11, 2025 20:31

scovich approved these changes Aug 11, 2025

View reviewed changes

alamb approved these changes Aug 12, 2025

View reviewed changes

alamb merged commit 97c0e7c into apache:main Aug 12, 2025
23 checks passed

		return Err(ArrowError::ParseError(
		"Record decoder consumed 0 bytes".into(),

Refactor arrow-avro Decoder to support partial decoding #8100

Refactor arrow-avro Decoder to support partial decoding #8100

Uh oh!

Conversation

jecsand838 commented Aug 9, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jecsand838 commented Aug 10, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Aug 11, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor arrow-avro `Decoder` to support partial decoding #8100

Refactor arrow-avro `Decoder` to support partial decoding #8100

jecsand838 commented Aug 9, 2025 •

edited

Loading

jecsand838 commented Aug 10, 2025 •

edited

Loading

jecsand838 Aug 11, 2025 •

edited

Loading