-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WASM UDFs #9326
Comments
BTW #9332 might be related (API to allow functions to be registered externally) |
Example exposing JVM as UDF with JNI https://github.com/milenkovicm/adhesive using #9332 From my experience WASM is going to be a bit more complicated as there are many things missing. |
Could you elaborate this? What things / building blocks / features / ... are missing? Is it WASM stuff, or Arrow/DataFusion stuff? |
Ive done few pocs some time ago, they were focused on rust calling rust generated wasm. I found that documentation and/or examples missing for anything more complex than simple usage. Due to all this I personally found it very complex to grasp, and not fit for my use case. Also, from this perspective, I believe I failed to understand importance of wasm bindgen as well. I gave up before reaching stage to add arrow. Now, don't get discouraged, looks like there is progress in wasm community. Personally it looks wasm edge may be the way to go and I'll try to replace JVM implementation with wasm in near future, from JVM experience it should not take more than few days work to get basic example running. I hope this helps |
I totally agree that mapping Arrow through the WASM boundary isn't a trivial task, which is also why I've opened this ticket and I wanna have one impl. and SDKs / blueprints for different UDF "guest" languages, so people don't have to suffer through this pain on their own. |
My understanding is that any interaction with wasm would involve data copy across boundary. Do I miss something? On that note, what would be the simplest way to convert |
It's very tied to JS at the moment, but you can use https://github.com/kylebarron/arrow-js-ffi to view Arrow data across the Wasm boundary in the cases where you want to persist the data in Wasm. I'm not sure how this would work with server-side wasm though. It seems like you could use the Arrow C Data interface across the Wasm boundary directly, though? The main work in Arrow JS FFI is just implementing the C Data Interface in JS, but it already exists in Rust.
This (and a related/successor project I've been working on https://github.com/kylebarron/arrow-wasm) are both quite tied to wasm-bindgen and browser use cases. It's not clear to me how they'd work in WASI. |
I just read your article @kylebarron https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly 👍🏻 few minutes ago
https://lib.rs/crates/wasmedge-bindgen does not look tided up to browser/js |
It might be relevant #9834 |
Fresh off the press, a quick WASM UDF poc built on top of WasmEdge library https://github.com/milenkovicm/wasaffi It is working, but far from being production ready. Data needs to be copied to wasm VM ( arrow ipc-ed to local buffer then buffer copied across using wasmedge bindgen, and same for function result ). Most of the complexity can be hidden with a simple macro. One note, ipc buffer contains schema definition as well, can this be avoided if both parties know the schema? |
That was my hope. I think we should make sure though that the FFI WASM API is well defined and that other guest languages (like C++ or Python) could also implement that boilerplate w/o needing to go through a Rust layer within the guest (otherwise the WASM payloads get rather large).
Yes. This is for example what Arrow Flight (SQL) also uses. The client can query a server w/o knowing the schema a priori. |
IMHO, I dont think there is too many problems on datafusion/arrow side, WASM side FFI interface is still left to be desired, and in constant state of flux. |
Is your feature request related to a problem or challenge?
Databases / ETL solutions built on top of DataFusion can use UDFs (in their various forms) to extend the functionality of DataFusion, e.g. to add new scalar or aggregation functions. This extensibility however does NOT automatically extend to their users, since they cannot and (for security reasons) should not add code to the running system. So the U in UDF currently stands for "DataFusion API User", not for "End-User".
WASM provides a way to run user code in a secure sandbox under an "unknown" host (i.e. the user does not need to know about the operating system or CPU architecture). A DataFusion-based solution can use that to implement UDFs. However, since the calling convention from the WASM payload to the UDF are solution-defined, the end user is likely to have a hard time with it, and there is likely only a non-existing/small ecosystem for tooling to develop UDFs.
Defining the UDF WASM interface in DataFusion -- potentially in collaboration with the Arrow (since we need to get Arrow data across the WASM memory boundary) -- would likely facilitate a wider ecosystem and a more streamlined solution. Prior art to this is Arrow Flight, which is now being integrated into more and more server and client implementations.
Describe the solution you'd like
Describe alternatives you've considered
Additional context
Projects that might be helpful:
The text was updated successfully, but these errors were encountered: