Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use lz4_flex for wasm32 #91

Closed
wants to merge 7 commits into from

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Mar 7, 2022

Disclaimer: I'm new to Rust but motivated to get LZ4 Parquet decompression working in wasm. 🙂

I'm working on wasm bindings to arrow2/parquet2. So far it appears that all other compressions other than LZ4 are working. I tried to switch to lz4_flex on this branch but I'm not sure I gated the wasm target correctly. Tests seem to pass on this branch but when I try to load a Parquet file with LZ4 encoding on the web I get an error Uncaught (in promise) External format error: underlying IO error: WrongMagicNumber. Maybe I'm using the wrong lz4_flex APIs.

Ref #85

@jorgecarleitao
Copy link
Owner

Thanks a lot for the PR!

Would it be possible to use a feature flag instead of a conditional on the target? Folks not in wasm may want to use lz4_flex regardless of wasm, so maybe a feature could make more sense here?

This also allows us to add to the CI to run the tests with and without this flag, thereby covering both branches without having to run the tests under wasm.

@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2022

Codecov Report

Merging #91 (db03d9e) into main (aaef5fe) will not change coverage.
The diff coverage is n/a.

❗ Current head db03d9e differs from pull request most recent head 5cb086d. Consider uploading reports for the commit 5cb086d to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##             main      #91   +/-   ##
=======================================
  Coverage   67.62%   67.62%           
=======================================
  Files          69       69           
  Lines        3830     3830           
=======================================
  Hits         2590     2590           
  Misses       1240     1240           
Impacted Files Coverage Δ
src/compression.rs 89.70% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aaef5fe...5cb086d. Read the comment docs.

@kylebarron
Copy link
Contributor Author

Would it be possible to use a feature flag instead of a conditional on the target?
This also allows us to add to the CI to run the tests with and without this flag

Yes that makes a ton of sense. I've tried to update this to use a feature flag instead, where if lz4 and lz4_flex features are both specified, then lz4 takes precedence.

I'm not sure the optimal way to add a test for this; I added a CI unit test call with no default features and with lz4_flex added instead of lz4.

I'm also still not sure I'm using the lz4_flex API correctly 😄 .

@jorgecarleitao
Copy link
Owner

Sorry for the delay, I am focused on releasing arrow2 0.10, so this is a bit delayed on my end. What I would do here is to have a test where we have a pair (decompressed, compressed): (Vec<u8>, Vec<u8>) obtained from any of the packages, and test that we can roundtrip it with both feature flags.

This requires 1 test (with the pair) and a new entry in the ci / github workflows with the wasm/lz4_flex feature active, to show that the test runs on both cases. Does it make sense?

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the test. It looks good to me. Thanks again!

@kylebarron
Copy link
Contributor Author

kylebarron commented Mar 14, 2022

I did previously add the line in test.yml to run the existing tests with lz4_flex, so maybe that resolved your question from above.

But I'm still getting issues when trying to actually test reading a minimal parquet file. And it appears to not work either with the lz4 feature or with the lz4_flex feature.

I created a simple pure-rust script here (https://github.com/kylebarron/parquet-wasm/pull/36/files#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc). When I run with the lz4 feature, I get:

> cargo run --no-default-features --features arrow2 --features parquet_compression_lz4
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ExternalFormat("underlying IO error: LZ4 error: ERROR_frameType_unknown")', src/main.rs:22:33

When I run with the lz4_flex feature, I get:

> cargo run --no-default-features --features arrow2 --features parquet_compression_lz4_flex
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ExternalFormat("underlying IO error: WrongMagicNumber")', src/main.rs:22:33

That's when reading this Parquet file created by this Python script. And it's readable in Pyarrow:

In [2]: import pyarrow.parquet as pq

In [3]: pq.ParquetFile('1-partition-lz4.parquet').read()
Out[3]:
pyarrow.Table
str: string
uint8: uint8
int32: int32
bool: bool

@jorgecarleitao
Copy link
Owner

Ok, main lz4 has been fixed and has tests for lz4 (against pyarrow). This should unblock us here :)

@jorgecarleitao
Copy link
Owner

let me know if you need any help or guidance. We may also ping the developer of lz4_flex for help here.

@kylebarron
Copy link
Contributor Author

kylebarron commented Mar 18, 2022

I haven't had time to look at this this week. That said, even if the CI were green, I'd want to test this against files created with the the original LZ4 compression (and from pyarrow), to make sure we're using the API correctly in both cases. For example, the failing CI test looks to be a simple round trip of compression and decompression, and while the LZ4 flex docs have a simple example of block format roundtrip, using those functions might satisfy this CI test, but not the actual Parquet spec.

I guess it would be ideal to run the pyarrow integration test both with lz4 and with lz4_flex.

@jorgecarleitao
Copy link
Owner

Rebased against main and fixed the issue here #124 . Could you take a look @kylebarron and see if it addresses your concerns?

@jorgecarleitao
Copy link
Owner

Closed (done as part of #124 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants