ARROW-8676: [Rust] IPC RecordBatch body compression #9137

nevi-me · 2021-01-08T15:21:58Z

Adds Recordbatch body compression, which compresses the buffers that make up arrays (e.g. offsets, null buffer).
I've restricted the write side to only work with v5 of the metadata. We can expand on this later, as I think the non-legacy v4 supports the BodyCompression method implemented here. Reading should be fine if the compression info is specified.

This PR is built on top of ARROW-10299 (#9122).

I have not yet implemented ZSTD compression, but I expect it shouldn't be too much work, so can still be done as part of this PR.

Creates body compression, with LZ4_FRAME supported. This depends on ARROW-10299 being merged first.

nevi-me · 2021-01-08T15:27:10Z

For anyone who understands how LZ4 works, I need some help.
I'm able to read LZ4 compressed data written in Python from Rust, but I get an error when trying to read a file written by the Rust writer from pyarrow.

The error that I'm getting is:

OSError: Lz4 compressed input contains more than one frame

To reproduce the error, please:

Run the arrow unit tests, so that a compressed file is created.
Run the below to read the file in pyarrow

import pyarrow as pa
file = pa.ipc.open_file("$ARROW_ROOT/rust/arrow/target/debug/testdata/primitive_lz4.arrow_file")
batches = file.read_all()

Thanks

mqy · 2021-01-08T15:38:03Z

The error can be found at https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc line 283-285:

if (input_len != 0) {
      return Status::IOError("Lz4 compressed input contains more than one frame");
}

Seems that input_len remains non-zero while the decompression completed.

github-actions · 2021-01-08T15:38:49Z

https://issues.apache.org/jira/browse/ARROW-8676

nevi-me · 2021-01-08T15:48:09Z

Seems that input_len remains non-zero while the decompression completed.

the logic of how the buffers are compressed isn't explained in a beginner-friendly way, so I likely am implementing the compression incorrectly. I used the Java implementation in #8949 as a reference. I couldn't quite understand the C++ implementation.

mqy · 2021-01-08T17:18:01Z

rust/arrow/src/ipc/compression.rs

+        output_buf.write_all(&(input_buf.len() as i64).to_le_bytes())?;
+        let mut encoder = lz4::EncoderBuilder::new().build(output_buf)?;
+        let mut from = 0;
+        loop {


This loop can be omitted.

mqy · 2021-01-09T05:09:07Z

@nevi-me would you please have a look at BodyCompressionBuilder from ipc/gen/Message.rs?

nevi-me · 2021-01-09T05:27:43Z

@nevi-me would you please have a look at BodyCompressionBuilder from ipc/gen/Message.rs?

Did you notice something with it? I might need a bit more context :)

mqy · 2021-01-09T06:59:28Z

Did you notice something with it? I might need a bit more context :)

pub struct BodyCompressionArgs {
    pub codec: CompressionType,
    pub method: BodyCompressionMethod,
}

Perhaps you missed RecordBatchBuilder::add_compression()?

nevi-me · 2021-01-09T08:04:29Z

Perhaps you missed RecordBatchBuilder::add_compression()?

I missed it at the dictionary write, but not when writing a plain recordbatch; so it's not that. If I hadn't written the compression details completely, the C++ implementation wouldn't have known that the message is compressed or with LZ4.
I'll debug when I have sufficient time

mqy · 2021-01-10T08:27:35Z

Perhaps you missed RecordBatchBuilder::add_compression()?

I missed it at the dictionary write, but not when writing a plain recordbatch; so it's not that. If I hadn't written the compression details completely, the C++ implementation wouldn't have known that the message is compressed or with LZ4.
I'll debug when I have sufficient time

@nevi-me I pulled your branch locally, found add_compression in writer.rs. I'm sorry that I failed to search add_compression from https://github.com/apache/arrow/pull/9137/fies because writer.rs is not loaded.

I'm installing pyarrow, after run test test_write_file_v5_compressed.

mqy · 2021-01-10T10:11:40Z

@nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
Message.rs requires LZ4 frame format, not the block format.

  // LZ4 frame format, for portability, as provided by lz4frame.h or wrappers
  // thereof. Not to be confused with "raw" (also called "block") format
  // provided by lz4.h

It's a bit complicate to implement with c bindings as what frameCompress.c does. Also found another crate named lzzzz for lz4 frame:
https://docs.rs/lzzzz/0.8.0/lzzzz/lz4f/index.html, where both compress_to_vec and decompress_to_vec looks quite simple, I had made the unit test passed with these APIs.

FYI:

[lz4frame.h] mentioned in Message.rs: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
[lz4 frame example c code] https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/examples/frameCompress.c
[lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs
[c++ uses LZ4F_xxx APIs] https://github.com/apache/arrow/blob/3694794bdfd0677b95b8c95681e392512f1c9237/cpp/src/arrow/util/compression_lz4.cc

mqy · 2021-01-11T15:07:30Z

C++ has flags for codec. Perhaps we should also add feature for this, because:

Compression may not be a very common feature, I saw ARROW-11188: [Rust] Support crypto functions from PostgreSQL dialect #9139 also has similar discussions.
lz4 library has unsafe codes that depend on liblz4

mqy

Perhaps the arg total_len of buffers.push(ipc::Buffer::new(...) should not account for pad_len.

mqy · 2021-01-11T12:59:34Z

rust/arrow/src/ipc/writer.rs

    let pad_len = pad_to_8(len as u32);
    let total_len: i64 = (len + pad_len) as i64;
    // assert_eq!(len % 8, 0, "Buffer width not a multiple of 8 bytes");
    buffers.push(ipc::Buffer::new(offset, total_len));


@nevi-me perhaps the arg total_len of buffers.push(ipc::Buffer::new(...) should not account for pad_len. After fixing this possible bug, my test on pyarrow passed and no failure with cargo test, but I'm doubting if this is the root cause, because:

I changed quite a lot based on your PR.

This line was last updated in PR ARROW-518 a year ago.

I took quite some time to compare with the C++ code, it looks c++ set buffer size as actual memory size.
The IPC spec recordbatch-message states that:

The size field of Buffer is not required to account for padding bytes. Since this metadata can be used to communicate in-memory pointer addresses between libraries, it is recommended to set size to the actual memory size rather than the padded size.

FYI: 68a558b at master...mqy:nevi-me_ARROW-8676 based on your PR.

nevi-me · 2021-01-25T18:27:43Z

I'll come back to this in the coming weeks

nevi-me added 5 commits January 7, 2021 11:03

ARROW-10299: [Rust] Add IPC V5 tests, set V5 as default

e6cd6fb

Add tests for different padding lengths

ccee6a3

update doc

2af63f3

add null tests

bd2e13e

ARROW-8676: [Rust] IPC RecordBatch body compression

0bdeb97

Creates body compression, with LZ4_FRAME supported. This depends on ARROW-10299 being merged first.

github-actions bot added the Component: Rust label Jan 8, 2021

mqy reviewed Jan 8, 2021

View reviewed changes

huitseeker mentioned this pull request Jan 8, 2021

ARROW-10350: [Rust] Fixes to publication metadata in Cargo.toml #9031

Closed

mqy reviewed Jan 13, 2021

View reviewed changes

nevi-me closed this Jan 25, 2021

nevi-me mentioned this pull request Jun 22, 2021

Unable to load Feather v2 files created by pyarrow and pandas. apache/arrow-rs#286

Closed

ghuls mentioned this pull request Feb 14, 2022

Support chunk_size argument in write_ipc and sink_ipc. pola-rs/polars#2639

Open

asfimport mentioned this pull request Apr 26, 2021

[Rust] Create implementation of IPC RecordBatch body buffer compression from ARROW-300 #24835

Closed

ARROW-8676: [Rust] IPC RecordBatch body compression #9137

ARROW-8676: [Rust] IPC RecordBatch body compression #9137

Uh oh!

Conversation

nevi-me commented Jan 8, 2021

Uh oh!

nevi-me commented Jan 8, 2021

Uh oh!

mqy commented Jan 8, 2021

Uh oh!

github-actions bot commented Jan 8, 2021

Uh oh!

nevi-me commented Jan 8, 2021

Uh oh!

mqy Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

mqy commented Jan 9, 2021

Uh oh!

nevi-me commented Jan 9, 2021

Uh oh!

mqy commented Jan 9, 2021

Uh oh!

nevi-me commented Jan 9, 2021

Uh oh!

mqy commented Jan 10, 2021

Uh oh!

mqy commented Jan 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mqy commented Jan 11, 2021

Uh oh!

mqy left a comment

Choose a reason for hiding this comment

Uh oh!

mqy Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

nevi-me commented Jan 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mqy commented Jan 10, 2021 •

edited

Loading