Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet does not support wasm32-unknown-unknown target #180

Closed
alamb opened this issue Apr 26, 2021 · 9 comments · Fixed by #2896
Closed

Parquet does not support wasm32-unknown-unknown target #180

alamb opened this issue Apr 26, 2021 · 9 comments · Fixed by #2896
Labels
parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11593

The Arrow crate successfully compiles to WebAssembly (e.g. https://github.com/domoritz/arrow-wasm) but the Parquet crate currently does not support thewasm32-unknown-unknown target.

Try out the repository at domoritz/parquet-wasm@e877f9a. The problem seems to be in liblz4, even if I do not include lz4 in the feature flags.

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Dominik Moritz(domoritz) @ 2021-02-12T21:59:13.949+0000:

If lz4 is the issue, maybe we could switch to https://github.com/PSeitz/lz4_flex, which compiles to WASM. 

Comment from Andy Redhead(AndyRedhead1974) @ 2021-03-04T21:51:18.454+0000:

A WebAssembly compatible Rust library that can read data from Parquet files would be very useful to anyone who would like to do "browser based" data processing/visualisation (better still if that library is in a family that includes efficient in-memory "data structures"). 

Comment from David Roher(droher) @ 2021-04-12T02:03:38.512+0000:

I just got a version of DataFusion working on wasm32-unknown-unknown – it required disabling both the LZ4 and ZSTD features on Parquet and tweaking the hash function: [https://github.com/apache/arrow/compare/master...droher:master]

To add to [~AndyRedhead1974]'s point above, it would also be useful in a serverless context – for instance, Cloudflare Workers Unbound is in beta now and will allow WASM functions to run at unlimited CPU usage. in this context, DataFusion could be a serverless data lake engine like AWS Athena. Maybe it could even be useful as a Ballista worker.

Comment from Dominik Moritz(domoritz) @ 2021-04-12T04:38:44.388+0000:

That's awesome. Do you want to add a note to https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615, which tracks DataFusion support for wasm?

@jorgecarleitao jorgecarleitao added parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 26, 2021
@kylebarron
Copy link
Contributor

Just wondering if there's anything actionable here or if it's open ended without a clear solution. I'm just starting to learn rust but would love for this to work in wasm.

@domoritz
Copy link
Member

domoritz commented Dec 9, 2021

I think someone needs to go through the dependencies and replace them with ones that work in wasm. I think this is pretty doable.

@kylebarron
Copy link
Contributor

In the original issue you mentioned

The problem seems to be in liblz4, even if I do not include lz4 in the feature flags.

Testing locally it appears that both the arrow (v6.4.0) and arrow2/parquet2 (v0.8.1) crates compile fine to wasm (using wasm-pack build) as long as I avoid both lz4 and zstd dependencies. My goal is to first get a working Parquet -> Arrow reader working in wasm, and then circle back to evaluate new lz4 and zstd implementations. But for now I'm still muddling through Rust 😅 .

@andyredhead
Copy link

I had a play with getting a minimalist apache (v1) arrow & parquet to compile to wasm32-unknown-unknown back in July/August (2021) when the released version of arrow was ~5.0. It worked ok, I vaguely remember having to do something around adding an annotation on the hashcode for one of the structs to point rustc at an implementation that worked in wasm (I'm very new to Rust and definitely still in the "muddling through" phase").

My tinkering has been on the "back burner" since September (for a number of reasons, not least the hard disk on my personal laptop dying), I got chance over the Christmas break just gone to recover what I can from the dead disk and get started again :)

The apache arrow-rs v6.5.0 crate builds without any modifications :)

I've put the (very basic) results of my tinkering into a git repo. Its based on the "Rust and Webassembly" example project, uses the parquet crate to read a parquet file and the javascript arrow library to read values out of the result.

I recently stumbled across a reference to the arrow2/parquet2 projects, the design goals seem sensible but I haven't had chance to look at them yet.

@alamb
Copy link
Contributor Author

alamb commented Jan 4, 2022

Yeah -- we now test to ensure that Arrow builds on wasm as part of all the CI runs:

https://github.com/apache/arrow-rs/runs/4685124037?check_suite_focus=true

@kylebarron
Copy link
Contributor

I made another effort on top of @andyredhead 's helpful repo to create a minimal JS parser from Parquet to Arrow. So far it seems to work with Snappy and uncompressed Parquet files, though the generated Arrow IPC files seem to be occasionally malformatted (errors of Expected to read 1264832 metadata bytes, but only read 700). To support gzip it looks like we could try using flate2 when building for wasm: rust-lang/flate2-rs#161.

@kylebarron
Copy link
Contributor

kylebarron commented Mar 4, 2022

Another update on the compression codecs:

In terms of Arrow IPC files being malformatted, I switched from arrow::ipc::writer::StreamWriter to arrow::ipc::writer::FileWriter. Now all the Arrow files generated by parquet-wasm from my Parquet test files are readable in Python using pa.ipc.open_file, so presumably the JS errors arising from arrow.tableFromIPC(tableBytes) are issues with the JS library and its IPC parser. (It looks like there are known issues with the IPC support in JS ARROW-15642, ARROW-13818, ARROW-8674)

@alamb
Copy link
Contributor Author

alamb commented Mar 5, 2022

Thank you for the update @kylebarron

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants