You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The IO layer acts as a critical part for a table storage framework. However, current IO implementation suffers some drawbacks:
The IO stack tightly coupled with Spark, making it difficult to adapt to other computing frameworks including both SQL engines and AI engines (Flink, Presto, Pandas, Dask, PyTorch, etc.).
The file IO still relies on Hadoop FileSystem, which is inefficient on high-latency storages such as S3.
The lack of expression evaluation capability makes it difficult to implement a compute engine independent MergeInto SQL with merge on read.
Goals
Compute engine neutral. Native IO layer implements self-contained IO logics such as merge on read and provides interfaces for compute engines to invoke.
Native. The compute engines are not just in Java world. We would also like to support popular Python data processing frameworks like Pandas/Dask, and AI frameworks such as PyTorch with C++ at its core.
Fast. IO on object stores usually have high latency and lack of file semantics and is a drag on overall execution pipeline. Things got worse in cases when there are multiple files to merge. We would like the IO layer to enable concurrency and asynchronism on read and write paths.
Feature-rich. Native IO layer should support commonly required data source features such as predicate push down, index filtering, atomic write, etc. We also would like the reader to support MergeInto SQL within the IO layer so that the merge logics are transparent for the compute engines.
Easy to use(embed). Native IO layer itself is a library and expose its functionality via C interfaces with Java/Python wrappers. It should be easy to be embedded into compute engines.
NonGoals
Native IO layer is NOT a distributed execution framework. The native IO layer is a library inside a single execution unit. It itself is not aware of any distributed execution context. It is up to the higher level engines on whether to and how to read/write data in parallel (e.g. partitions in Spark, splits in Presto).
Native IO layer is NOT a SQL engine. Native IO layer is not designed to be a SQL execution engine. Though the IO layer would have some expression evaluation capability, it primarily aims to provide table data read and write on data lakes. It acts as a data source to compute engines and should be used together with LakeSoul's meta layer.
Design
We use Arrow (arrow-rs) + DataFusion to implement the new IO layer with the following reasons:
Asynchronous IO with Parquet read and write, and the pipeline is executed asynchronously, which is in line with our design goals;
DataFusion brings a relatively complete implementation of physical plans and expressions, and can easily support MergeInto SQL with merge on read;
It is efficient and memory secure in rust with compiled native vectorized execution and is easy to provide bindings in other languages.
According to this design idea, the overall modules and execution logic are divided as follows:
The above diagram shows the logical hierarchy of the native IO layer. It has (roughly) the following parts from bottom up:
IO with datafusion's object store with async reader/writer traits.
File format based on object store with async.
Merge on read execution plan. The execution plan is combined with datafusion's builtin hash join or sort merge join and customized projection/filter plans to support MergeInto SQL. A typical plan with multiple files to merge would be in the following form:
Reader interface in Rust and C. Provide a simple interface to iterate the merged arrow record batches asynchronously. The rust interface could be the following:
And we could also expose an extern "C" interface with callback to support async IO.
JNI/Python wrapper. In JNI wrapper, we could provide a native method to accept arrow schema and array pointers together with a callback object, like the following:
In which the native implementation (asynchronously) iterate the native stream and get the next available arrow record batch, populate the arrow c data structs by their pointers and call the callback with a boolean arg to indicate the end of stream. The expected usage pattern would like (in Scala):
valp=Promise[Option[VectorSchemaRoot]]()
tryWithResource(ArrowSchema.allocateNew(allocator)) { consumerSchema =>
tryWithResource(ArrowArray.allocateNew(allocator)) { consumerArray =>valschemaPtr:Long= consumerSchema.memoryAddress
valarrayPtr:Long= consumerArray.memoryAddress
reader.nextBatch((hasNext) => {
if (hasNext) {
valroot:VectorSchemaRoot=Data.importVectorSchemaRoot(allocator, consumerArray, consumerSchema, provider)
p.success(Some(root))
} else {
p.success(None)
}
})
}
}
valfut:Future[Option[VectorSchemaRoot]] = p.future
// get recordbatch from future either sync or async
Compute engine adapters. For example in Spark, we could implement a vectorized reader based on the above interface, and implement the datasource v2 interfaces.
Plan
We plan to first implement a default overwrite merge logic reader tracked under this issue. Further support of MergeInto SQL would be in a separated tracking issue.
Motivations
The IO layer acts as a critical part for a table storage framework. However, current IO implementation suffers some drawbacks:
Goals
NonGoals
Design
We use Arrow (arrow-rs) + DataFusion to implement the new IO layer with the following reasons:
According to this design idea, the overall modules and execution logic are divided as follows:
The above diagram shows the logical hierarchy of the native IO layer. It has (roughly) the following parts from bottom up:
IO with datafusion's object store with async reader/writer traits.
File format based on object store with async.
Merge on read execution plan. The execution plan is combined with datafusion's builtin hash join or sort merge join and customized projection/filter plans to support MergeInto SQL. A typical plan with multiple files to merge would be in the following form:
Reader interface in Rust and C. Provide a simple interface to iterate the merged arrow record batches asynchronously. The rust interface could be the following:
And we could also expose an
extern "C"
interface with callback to support async IO.JNI/Python wrapper. In JNI wrapper, we could provide a native method to accept arrow schema and array pointers together with a callback object, like the following:
In which the native implementation (asynchronously) iterate the native stream and get the next available arrow record batch, populate the arrow c data structs by their pointers and call the callback with a boolean arg to indicate the end of stream. The expected usage pattern would like (in Scala):
Compute engine adapters. For example in Spark, we could implement a vectorized reader based on the above interface, and implement the datasource v2 interfaces.
Plan
We plan to first implement a default overwrite merge logic reader tracked under this issue. Further support of MergeInto SQL would be in a separated tracking issue.
Development Branch
develop/native_io
Tasks
The text was updated successfully, but these errors were encountered: