Skip to content

Conversation

@jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Jul 19, 2025

This adds a new zero_byte.avro file which will be used to test zero byte records in arrow-avro

Here's the python script used to generate the new avro file: https://gist.github.com/jecsand838/e57647d0d12853f3cf07c350a6a40395

Part of apache/arrow-rs#4886

alamb pushed a commit to apache/arrow-rs that referenced this pull request Jul 22, 2025
# Which issue does this PR close?

- Part of #4886

- Follow up to #7834

# Rationale for this change

The initial Avro reader implementation contained an under-developed and
temporary safeguard to prevent infinite loops when processing records
that consumed zero bytes from the input buffer.

When the `Decoder` reported that zero bytes were consumed, the `Reader`
would advance it's cursor to the end of the current data block. While
this successfully prevented an infinite loop, it had the critical side
effect of silently discarding any remaining data in that block, leading
to potential data loss.

This change enhances the decoding logic to handle these zero-byte values
correctly, ensuring that the `Reader` makes proper progress without
dropping data and without risking an infinite loop.

# What changes are included in this PR?

- **Refined Decoder Logic**: The `Decoder` has been updated to
accurately track and report the number of bytes consumed for all values,
including valid zero-length records like `null` or empty `bytes`. This
ensures the decoder always makes forward progress.
- **Removal of Data-Skipping Safeguard**: The logic in the `Reader` that
previously advanced to the end of a block on a zero-byte read has been
removed. The reader now relies on the decoder to report accurate
consumption and advances its cursor incrementally and safely.
- * New integration test using a temporary `zero_byte.avro` file created
via this python script:
https://gist.github.com/jecsand838/e57647d0d12853f3cf07c350a6a40395

# Are these changes tested?

Yes, a new `test_read_zero_byte_avro_file` test was added that reads the
new `zero_byte.avro` file and confirms the update.

# Are there any user-facing changes?

N/A

# Follow-Up PRs

1. PR to update `test_read_zero_byte_avro_file` once
apache/arrow-testing#109 is merged in.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks @jecsand838

@alamb alamb merged commit f8eb6e3 into apache:master Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants