Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPC code writes files which do not include a valid stream #6311

Open
bkietz opened this issue Aug 26, 2024 · 3 comments · May be fixed by #6321
Open

IPC code writes files which do not include a valid stream #6311

bkietz opened this issue Aug 26, 2024 · 3 comments · May be fixed by #6321
Labels

Comments

@bkietz
Copy link
Member

bkietz commented Aug 26, 2024

Describe the bug

The format guarantees that each IPC file embeds a valid IPC stream in order to allow readers to ignore the Footer, skip the file's leading magic, and reuse a stream reader.

However, when writing IPC files arrow-rs aligns the encapsulated flatbuffers Messages to 64 byte boundaries instead of 8 bytes. This can leave gaps of padding bytes between the Messages which a stream reader would not know to skip.

To Reproduce

apache/arrow#43834 adds validation of the embedded stream to arrow-c++'s arrow-json-integration-test. Running this against IPC files written by arrow-rs raises an error

$ arrow-json-integration-test -arrow datetime.arrow -json datetime.json -integration -mode VALIDATE
Error message: Invalid: Tried reading schema message, was null or length 0

Expected behavior

A stream reader should be able to read an IPC file by skipping the first 8 bytes.

Additional context

This was originally introduced in eddef43 which added alignment to address new requirements around i128. However the alignment should not be applied to flatbuffers Messages; apart from the above issue I think there's no SIMD or other advantage to aligning those to more than 8 bytes. Body buffers can of course still be padded and aligned freely.

This was discovered while adding IPC file reading to nanoarrow; we were trying to defer reading Footers for a follow up and discovered that the go, rust, and javascript implementations don't embed a valid stream. Most readers have not noticed because offsets and schemas are more efficiently read from a Footer, and once acquired obviate sequential stream-style reading.

@bkietz bkietz added the bug label Aug 26, 2024
@bkietz
Copy link
Member Author

bkietz commented Aug 26, 2024

I tried adding validation to arrow-rs as with apache/arrow#43834 with validate-stream.zip

@bkietz
Copy link
Member Author

bkietz commented Aug 26, 2024

@westonpace

@westonpace
Copy link
Member

Ben and I spoke about this a bit externally. As I understand it the main benefit is that it makes it simpler to consume the IPC format (you can consume both file and streaming formats with a single streaming implementation if you don't need the random access).

I don't believe there are any drawbacks to restricting the alignment of the bytes following the magic header. I also don't envision any backwards compatibility concerns (readers that are using the flatbuffers footers can continue to accept whatever header alignment the producer is creating, the change is just to stop writing certain alignments so that readers that don't use the footer can start working)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@bkietz @westonpace and others