-
Notifications
You must be signed in to change notification settings - Fork 38
Description
JavaScript GeoArrow Module Proposal
The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around GeoArrow in JavaScript fit together really well.
This is a corollary to the Python GeoArrow Module Proposal but focused on GeoArrow interoperability in JavaScript and WebAssembly. I don't know anyone doing GeoArrow-Wasm stuff in C, so this will focus on my efforts in Rust and TypeScript. Unlike in Python, there aren't other people currently working on JavaScript GeoArrow infrastructure, so this is a manifesto to solidify my ideas.
WebAssembly limitations
WebAssembly is sandboxed, which means that Wasm code can only access and modify memory within its own memory space. So Wasm code cannot access JavaScript objects directly.
This also means that two Wasm modules can't share memory. So if you have one Wasm-based NPM library that loads GeoParquet to GeoArrow and another Wasm-based NPM library that implements spatial operations on GeoArrow, there must be a copy from the first module's memory space into JavaScript and then into the second module's memory space.
This means that grouping Wasm functionality together into a single module is more performant, as I/O and operations can be done in a single memory space. This runs up against bundle size: JavaScript bundlers are able to tree-shake JavaScript code, but they can't tree-shake a prebuilt Wasm binary. Instead, the original Rust would have to be recompiled, excluding unwanted functions.
The solution I'm gravitating towards is to have a variety of NPM libraries, described in this document, where I/O or operations are distributed both as their own libraries but also in a "kitchen sink" build, which contains everything at the cost of a larger bundle size. Advanced users can compile custom Wasm binaries from the rust source, with only the desired functionality.
Goals
Similar goals to the Python module proposal:
- Modular: the user can install what they need and choose which dependencies they want, with the goal of somewhat fine-tuned control of bundle size.
- Interoperable: the user can use WebAssembly-based and pure-JavaScript GeoArrow libraries together smoothly.
- Extensible: future developers can develop on top of
geoarrow-wasmand largely reuse its JS bindings without having to create ones from scratch - Strongly typed. A method like
convex_hullshould always return aPolygonArrayinstead of a genericGeometryArraythat the user can't "see into" statically. - Static typing: Full typing support and IDE autocompletion.
Data Movement
In contrast to Python, which is able to share the same memory space with native code, data movement between Wasm and JS is not always free, because they occupy two separate memory spaces. JS can see into Wasm memory but not the opposite. This means that data movement from Wasm -> JS can be zero-copy, but JS -> Wasm requires a copy.
The easiest data movement in JS is to use Arrow IPC buffers to move serialized data between JS and Wasm, but this has a number of drawbacks:
- Significant memory overhead: when constructing the IPC buffer, all
Datachunks need to be copied into a newArrayBuffer, a full copy of the dataset, before the copy into/out of Wasm. - All
Datachunks in JS memory are references onto the same backingArrayBuffer(from the original IPC buffer), which means aDatainstance can't be transferred to a WebWorker without a copy.
The most performant data movement in JS is to directly view data from Wasm memory and conversely for JS to write array data directly into the Wasm memory space. I've been working on this in arrow-js-ffi and it's a crucial part of Arrow interoperability in Wasm. This solves both of the downsides of Arrow IPC, as it avoids an extra data copy and the Data instances in JS have a backing buffer not shared with any other Data.
Module hierarchy
Here's a quick (messy) picture of the dependency graph. An arrow points to the library it depends on, so here geoarrow-wasm depends on geoarrow-rs.
The most important part is that there are no dependency cycles.
Rust Core (non-Wasm)
geoarrow-rs is the rust core with all core GeoArrow functionality. All algorithms, core I/O, etc are implemented in this crate so that as much as possible can be shared among pure-Rust, JS, and Python.
This crate does not on its own have any JS bindings. All JS functionality is exported in separate crates/packages below.
- Rust crate name:
geoarrow
Arrow-Wasm Core
Shared arrow definitions and FFI functionality to/from Arrow JS.
- Rust crate name:
arrow-wasm - JS package name: None? It's unclear whether this should even be published to NPM, as it's not useful on its own; it's useful as a building block for other libraries.
- Dependencies:
- Only the
arrowcrate.
- Only the
- Defines common abstractions in Rust with JS-facing APIs for
Table,Vector,Data,DataType. - Enables zero-copy (or one-copy, but serialization-free) interop with Arrow JS.
Computational library
Standalone library for spatial operations on GeoArrow arrays, without any I/O except for Arrow IPC and FFI. The slim compilation feature of geoarrow-wasm.
- Rust crate name:
geoarrow-wasm - JS package name:
@geoarrow/geoarrow-wasm-slim - Dependencies:
geoarrow-rsfor computational algorithms to wrap for JSarrow-wasmfor JS bindings for Arrow FFI with Arrow JS- Other dependencies in the graph are only used with the
fullcompilation feature, described below under "Kitchen Sink"
- Algorithms to operate on GeoArrow memory
- All operations that have a pure-Rust core and can be compiled seamlessly to Wasm
- For now, includes all algorithms. Maaybe in the future, we could have different NPM packages for different sets of libraries, but that sounds like a lot of work.
I/O Wasm libraries
There should exist standalone libraries with a minimal bundle size to read and write various file formats to/from GeoArrow.
parquet-wasm
Standalone library to read and write Parquet files in Wasm.
- Rust crate name:
parquet-wasm - JS package name:
parquet-wasm - Dependencies:
arrow-wasmfor JS bindings for Arrow FFI with Arrow JS
geoparquet-wasm
Standalone library to read and write GeoParquet files in Wasm.
- Rust crate name:
geoparquet-wasm - JS package name:
@geoarrow/geoparquet-wasm - Dependencies:
parquet-wasmfor JS bindings to read/write Parquetgeoarrow-rsto encode/decode WKB geometries to/from GeoArrow
- Functional API:
readGeoParquet: wrapsparquet-wasm'sreadParquet, converting WKB column to GeoArrow before returning anarrow-wasmTableinstancewriteGeoParquet: wrapsparquet-wasm'swriteParquet, converting GeoArrow in theTableinput to WKB before passing on towriteParquet.readGeoParquetStream: wrapsparquet-wasm'sreadParquetStream- TODO: more async APIs
flatgeobuf-wasm
Standalone library to read and write FlatGeobuf files in Wasm.
- Rust crate name:
flatgeobuf-wasm - JS package name:
@geoarrow/flatgeobuf-wasm - Dependencies:
arrow-wasmfor JS bindings for Arrow FFI with Arrow JSgeoarrow-rsto read/write FlatGeobuf to/from GeoArrow
- Functional API:
readFlatGeobuf: parses FlatGeobuf buffer, returning anarrow-wasmTableinstancewriteFlatGeobuf: creates a FlatGeobuf buffer from anarrow-wasmTableinstance.- Future:
readFlatGeobufStream: generates an async iterable ofarrow-wasmRecordBatchfrom a remote FlatGeobuf file - Future: read data by bounding-box from a remote file
The kitchen sink
The full compilation feature of geoarrow-wasm.
- Rust crate name:
geoarrow-wasm - JS package name:
@geoarrow/geoarrow-wasm - Dependencies:
arrow-wasmfor JS bindings for Arrow FFI with Arrow JSgeoparquet-wasmfor JS bindings for GeoParquetflatgeobuf-wasmfor JS bindings for FlatGeobufgeoarrow-rsfor algorithms
Pure JS Interop
This is designed to smoothly interop with pure-JavaScript Arrow libraries.
Arrow JS
The canonical implementation of Arrow in JS. It only supports IPC for data I/O.
Arrow JS FFI
A library to read/write Arrow data across the Wasm boundary. This interops with the core arrow-wasm crate above.
GeoArrow JS
A pure-JavaScript (TypeScript) implementation of GeoArrow. This uses the exact same memory layout as GeoArrow in Rust, so it should be possible to mix and match between pure-JS and wasm-based algorithms without changing data representations.
