-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8676: [Rust] IPC RecordBatch body compression #9137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Creates body compression, with LZ4_FRAME supported. This depends on ARROW-10299 being merged first.
|
For anyone who understands how LZ4 works, I need some help. The error that I'm getting is: To reproduce the error, please:
import pyarrow as pa
file = pa.ipc.open_file("$ARROW_ROOT/rust/arrow/target/debug/testdata/primitive_lz4.arrow_file")
batches = file.read_all()Thanks |
|
The error can be found at https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc line 283-285: Seems that input_len remains non-zero while the decompression completed. |
the logic of how the buffers are compressed isn't explained in a beginner-friendly way, so I likely am implementing the compression incorrectly. I used the Java implementation in #8949 as a reference. I couldn't quite understand the C++ implementation. |
| output_buf.write_all(&(input_buf.len() as i64).to_le_bytes())?; | ||
| let mut encoder = lz4::EncoderBuilder::new().build(output_buf)?; | ||
| let mut from = 0; | ||
| loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop can be omitted.
|
@nevi-me would you please have a look at |
Did you notice something with it? I might need a bit more context :) |
Perhaps you missed |
I missed it at the dictionary write, but not when writing a plain recordbatch; so it's not that. If I hadn't written the compression details completely, the C++ implementation wouldn't have known that the message is compressed or with LZ4. |
@nevi-me I pulled your branch locally, found I'm installing |
|
@nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something: It's a bit complicate to implement with c bindings as what FYI:
|
|
C++ has flags for codec. Perhaps we should also add feature for this, because:
|
mqy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the arg total_len of buffers.push(ipc::Buffer::new(...) should not account for pad_len.
| let pad_len = pad_to_8(len as u32); | ||
| let total_len: i64 = (len + pad_len) as i64; | ||
| // assert_eq!(len % 8, 0, "Buffer width not a multiple of 8 bytes"); | ||
| buffers.push(ipc::Buffer::new(offset, total_len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nevi-me perhaps the arg total_len of buffers.push(ipc::Buffer::new(...) should not account for pad_len. After fixing this possible bug, my test on pyarrow passed and no failure with cargo test, but I'm doubting if this is the root cause, because:
- I changed quite a lot based on your PR.
- This line was last updated in PR ARROW-518 a year ago.
I took quite some time to compare with the C++ code, it looks c++ set buffer size as actual memory size.
The IPC spec recordbatch-message states that:
The size field of Buffer is not required to account for padding bytes. Since this metadata can be used to communicate in-memory pointer addresses between libraries, it is recommended to set size to the actual memory size rather than the padded size.
FYI: 68a558b at master...mqy:nevi-me_ARROW-8676 based on your PR.
|
I'll come back to this in the coming weeks |
Adds Recordbatch body compression, which compresses the buffers that make up arrays (e.g. offsets, null buffer).
I've restricted the write side to only work with v5 of the metadata. We can expand on this later, as I think the non-legacy v4 supports the
BodyCompressionmethod implemented here. Reading should be fine if the compression info is specified.This PR is built on top of ARROW-10299 (#9122).
I have not yet implemented ZSTD compression, but I expect it shouldn't be too much work, so can still be done as part of this PR.