Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added optional support for LZ4 via LZ4-flex crate (thus enabling wasm) #124

Merged
merged 2 commits into from
Apr 15, 2022

Conversation

jorgecarleitao
Copy link
Owner

This PR was done together with @kylebarron and adds an optional dependency lz4-flex by @PSeitz as a LZ4 compressor/decompressor.

This implementation a bit slower but uses no unsafe and is written in native Rust, therefore supporting being compiled to wasm.

@jorgecarleitao jorgecarleitao added the feature A new feature label Apr 12, 2022
@kylebarron
Copy link
Contributor

I tried to test this out via this branch on parquet-wasm. But I got some compilation errors regarding types. I figured I was mixing arrow2 and parquet2 versions incorrectly, but I don't have a ton of time to debug this this morning.

Errors
kbarron at kbarron in ~/github/rust/parquet-wasm on kyle/parquet2-lz4-flex ✗                              [f2cfcfb]  10:36
> cargo run --bin read_parquet2 --features "debug arrow2 reader writer parquet2_supported_compressions" -- --input-file tests/data/1-partition-gzip.parquet --output-file test.arrow
   Compiling parquet2 v0.10.3 (https://github.com/jorgecarleitao/parquet2?rev=20c4d96eff5c472cc91784e43111ac92f7187059#20c4d96e)
   Compiling arrow2 v0.10.1 (https://github.com/kylebarron/arrow2?rev=41858ed0924cca67873851a647b9b28602039784#41858ed0)
error[E0432]: unresolved imports `parquet2::schema::types::LogicalType`, `parquet2::schema::types::TimestampType`
 --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/read/deserialize/simple.rs:5:9
  |
5 |         LogicalType, ParquetType, PhysicalType, TimeUnit as ParquetTimeUnit, TimestampType,
  |         ^^^^^^^^^^^ no `LogicalType` in `schema::types`                      ^^^^^^^^^^^^^ no `TimestampType` in `schema::types`

error[E0432]: unresolved imports `parquet2::schema::types::BasicTypeInfo`, `parquet2::schema::types::LogicalType`, `parquet2::schema::types::TimestampType`
 --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/read/schema/convert.rs:4:9
  |
4 |         BasicTypeInfo, GroupConvertedType, LogicalType, ParquetType, PhysicalType,
  |         ^^^^^^^^^^^^^                      ^^^^^^^^^^^ no `LogicalType` in `schema::types`
  |         |
  |         no `BasicTypeInfo` in `schema::types`
5 |         PrimitiveConvertedType, TimeUnit as ParquetTimeUnit, TimestampType,
  |                                                              ^^^^^^^^^^^^^ no `TimestampType` in `schema::types`

error[E0432]: unresolved imports `parquet2::schema::types::LogicalType`, `parquet2::schema::types::TimestampType`
 --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/read/statistics/primitive.rs:4:5
  |
4 |     LogicalType, ParquetType, TimeUnit as ParquetTimeUnit, TimestampType,
  |     ^^^^^^^^^^^ no `LogicalType` in `schema::types`        ^^^^^^^^^^^^^ no `TimestampType` in `schema::types`

error[E0432]: unresolved imports `parquet2::error::ParquetError`, `parquet2::schema::types::LogicalType`, `parquet2::schema::types::TimestampType`
  --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/read/mod.rs:14:5
   |
14 |     error::ParquetError,
   |     ^^^^^^^^^^^^^^^^^^^ no `ParquetError` in `error`
...
25 |         LogicalType, ParquetType, PhysicalType, PrimitiveConvertedType,
   |         ^^^^^^^^^^^ no `LogicalType` in `schema::types`
26 |         TimeUnit as ParquetTimeUnit, TimestampType,
   |                                      ^^^^^^^^^^^^^ no `TimestampType` in `schema::types`

error[E0432]: unresolved imports `parquet2::schema::types::DecimalType`, `parquet2::schema::types::IntType`, `parquet2::schema::types::LogicalType`, `parquet2::schema::types::TimeType`, `parquet2::schema::types::TimestampType`
 --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/write/schema.rs:5:13
  |
5 |             DecimalType, IntType, LogicalType, ParquetType, PhysicalType, PrimitiveConvertedType,
  |             ^^^^^^^^^^^  ^^^^^^^  ^^^^^^^^^^^ no `LogicalType` in `schema::types`
  |             |            |
  |             |            no `IntType` in `schema::types`
  |             no `DecimalType` in `schema::types`
6 |             TimeType, TimeUnit as ParquetTimeUnit, TimestampType,
  |             ^^^^^^^^                               ^^^^^^^^^^^^^ no `TimestampType` in `schema::types`
  |             |
  |             no `TimeType` in `schema::types`

error[E0433]: failed to resolve: could not find `ParquetError` in `error`
  --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/mod.rs:12:30
   |
12 |             parquet2::error::ParquetError::FeatureNotActive(_, _) => {
   |                              ^^^^^^^^^^^^ could not find `ParquetError` in `error`

error[E0433]: failed to resolve: could not find `ParquetError` in `error`
  --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/mod.rs:25:26
   |
25 |         parquet2::error::ParquetError::General(error.to_string())
   |                          ^^^^^^^^^^^^ could not find `ParquetError` in `error`

error[E0412]: cannot find type `ParquetError` in module `parquet2::error`
 --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/mod.rs:9:28
  |
9 | impl From<parquet2::error::ParquetError> for ArrowError {
  |                            ^^^^^^^^^^^^ not found in `parquet2::error`

error[E0412]: cannot find type `ParquetError` in module `parquet2::error`
  --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/mod.rs:10:37
   |
10 |     fn from(error: parquet2::error::ParquetError) -> Self {
   |                                     ^^^^^^^^^^^^ not found in `parquet2::error`

error[E0412]: cannot find type `ParquetError` in module `parquet2::error`
  --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/mod.rs:23:44
   |
23 | impl From<ArrowError> for parquet2::error::ParquetError {
   |                                            ^^^^^^^^^^^^ not found in `parquet2::error`

error[E0782]: trait objects must include the `dyn` keyword
  --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/read/deserialize/mod.rs:30:13
   |
30 | ) -> Result<PageIterator<R>> {
   |             ^^^^^^^^^^^^^^^
   |
help: add `dyn` keyword before this trait
   |
30 - ) -> Result<PageIterator<R>> {
30 + ) -> Result<dyn PageIterator<R>> {
   |

error[E0107]: this trait takes 0 generic arguments but 1 generic argument was supplied
  --> /Users/kbarron/.cargo/git/checkouts/arrow2-dbdaa0e42eec6fa2/41858ed/src/io/parquet/read/deserialize/mod.rs:30:13
   |
30 | ) -> Result<PageIterator<R>> {
   |             ^^^^^^^^^^^^--- help: remove these generics
   |             |
   |             expected 0 generic arguments
   |
note: trait defined here, with 0 generic parameters
  --> /Users/kbarron/.cargo/git/checkouts/parquet2-016c4444b5011f07/20c4d96/src/read/page/mod.rs:11:11
   |
11 | pub trait PageIterator: Iterator<Item = Result<CompressedDataPage, Error>> {
   |           ^^^^^^^^^^^^

Some errors have detailed explanations: E0107, E0412, E0432, E0433, E0782.
For more information about an error, try `rustc --explain E0107`.
error: could not compile `arrow2` due to 12 previous errors
warning: build failed, waiting for other jobs to finish...
error: build failed

@jorgecarleitao
Copy link
Owner Author

Thanks a lot for the feedback! You need to use jorgecarleitao/arrow2#923, which is waiting for this PR to land and a new release of parquet2 to depend on it :)

@kylebarron
Copy link
Contributor

I updated this branch to test, pointing to a fork off of jorgecarleitao/arrow2#923 so that I could point io_parquet_compression to parquet2/lz4_flex.

When trying to load an lz4-compressed file written by pyarrow, it still fails. I'm guessing that this is the difference between the two lz4 Parquet implementations. Pyarrow apparently writes files that advertise Compression::Lz4, not Compression:Lz4Raw

> cargo run --bin read_parquet2 --features "debug arrow2 reader writer parquet2_supported_compressions" -- --input-file tests/data/1-partition-lz4.parquet --output-file test_lz4.arrow
    Finished dev [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/debug/read_parquet2 --input-file tests/data/1-partition-lz4.parquet --output-file test_lz4.arrow`
Could not read parquet file: External format error: Compression Lz4 is not yet supported

Note that this is running locally, not within wasm. It uses this entry point. (I figured I needed a way to debug files outside of wasm, exactly for cases like this 😄 )

The lz4 parquet test file is here, made by this python script.

@jorgecarleitao
Copy link
Owner Author

@kylebarron , I am able to read it:

import pyarrow as pa  # pyarrow==7.0.0
import pyarrow.parquet


path = "bla.parquet"

t = pa.table(
    [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
    schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
)

pyarrow.parquet.write_table(
    t,
    path,
    use_dictionary=False,
    compression="LZ4",
)

(using parquet2_migrate branch in arrow2)

cargo run --example parquet_read --features io_parquet,io_parquet_compression -- bla.parquet

note that the file can't be read by spark, though:

import pyspark.sql  # pyspark==3.1.2

spark = pyspark.sql.SparkSession.builder.getOrCreate()

result = spark.read.parquet("bla.parquet").select("int64").collect()

fails

I think that there is a big problem with LZ4 in the ecosystem - I can see at least 3 implementations of LZ4 around: https://github.com/apache/arrow/blob/bf18e6e4b5bb6180706b1ba0d597a65a4ce5ca48/cpp/src/arrow/util/compression_lz4.cc

@kylebarron
Copy link
Contributor

Aha! I had pyarrow v6 (I think) installed before, and upgrading to pyarrow v7 did indeed produce different files. Diffing the generated python files, it looks like the older file had a footer with the text parquet-cpp version 1.5.1-SNAPSHOT while the new file has a footer with the string parquet-cpp-arrow version 7.0.0.

This command (cargo run --example parquet_read --features io_parquet,io_parquet_compression -- bla.parquet) does work with the new files, both with parquet2/lz4 and parquet2/lz4_flex.

Thanks for the PR!

1 similar comment
@kylebarron
Copy link
Contributor

Aha! I had pyarrow v6 (I think) installed before, and upgrading to pyarrow v7 did indeed produce different files. Diffing the generated python files, it looks like the older file had a footer with the text parquet-cpp version 1.5.1-SNAPSHOT while the new file has a footer with the string parquet-cpp-arrow version 7.0.0.

This command (cargo run --example parquet_read --features io_parquet,io_parquet_compression -- bla.parquet) does work with the new files, both with parquet2/lz4 and parquet2/lz4_flex.

Thanks for the PR!

@jorgecarleitao jorgecarleitao merged commit efc2a91 into main Apr 15, 2022
@jorgecarleitao jorgecarleitao mentioned this pull request Apr 15, 2022
@jorgecarleitao jorgecarleitao changed the title Add optional LZ4 flex Added optional support for LZ4 via LZ4-flex crate Apr 15, 2022
@jorgecarleitao jorgecarleitao changed the title Added optional support for LZ4 via LZ4-flex crate Added optional support for LZ4 via LZ4-flex crate (thus enabling wasm) Apr 15, 2022
@jorgecarleitao jorgecarleitao deleted the bla branch April 15, 2022 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants